0. Overview (BST260 Group 11 Members)

BACKGROUND

COVID-19 has impacted the world over, without sparing a single individual or country. It has been pointed out in the literature that the COVID-19 pandemic impacted not only those infected by COVID-19, but also the healthy population. Previous questionnaire surveys have reported that obesity, depression, and anxiety increased in healthy people during COVID-19 pandemic (1). Regional or nation-wide lockdowns and modification of work mode have resulted in decreased levels of physical activity in the healthy population. Since physical exercise is significantly associated with mental health and obesity, lack of exercise can lead to higher chances of developing obesity and depression.

Meanwhile, on the national or international level, concerns of viral transmission have caused the cancellation or postponement of various events, such as the 2020 Tokyo Summer Olympics. Originally planned to be held in the summer of 2020, these Olympic games were postponed indefinitely in March 2020. Athletes faced interrupted training routines, as many countries implemented restrictive lockdowns and shut down gyms and athlete training centers. Some athletes reportedly contracted the virus or were even hospitalized. It arouses our interest to explore the association between the pandemic and athlete performance.

OBJECTIVE

The primary scientific goal of our project is to see whether there was any difference in records for marathon games between pre- and post-pandemic periods. Our secondary scientific goal is to investigate whether countries with or without lockdown policies showed different trends for a change in records between pre- and post-pandemic periods. By doing these analyses, we can evaluate the association between the pre-/post-pandemic status and each record, possibly contributing to revealing the potential impacts of the pandemic on athletes’ preparedness for the Olympic games. Specifically, during our analysis, we will create graphs by using records in 2016 and 2020 Olympic Games and compare several models for the association between the pre-/post-pandemic status versus changes in marathon records.

Throughout these processes, we can learn a way of creating nicely visualized graphs and building/comparing different models by using R. The benefits in the current project include future indications for additional support for athletes during the pandemic, especially for the athletes who cannot practice sufficiently due to pandemic-related difficulties. To assess the pandemic’s impacts in detail, we will account for not only pre-/post-pandemic status but also the existence of lockdown policies in each country as additional features.

1. Data Collection Part (Yusuke Yoshikawa)

Brief summary of scraping

First, I collected the tables of marathon records at 4 Olympic marathon game. I further scraped the dates of birth of all athletes by accessing their individual pages. Scraping process was similar across the 4 games, so please see the descriptions at Men Tokyo2020 scraping for data collection of the Olympic games. Then, I combined the 4 datasets into one dataframe. Second, I collected data regarding lockdown policy, COVID-19 cases, populations, financial and geographic data of area/countries which athletes are from. Finally, I saved the dataset as a csv file for subsequent analyses by other members. There are explanations of the columns at the end of this section.

#### Libraries ####
library(tidyverse)
library(rvest)
library(lubridate)
#if (!require(RCurl)) {install.packages("RCurl", dependencies=TRUE)}
#library(RCurl)
#### Men at Tokyo 2020 ####
# URL of Men at Tokyo2020
url_20m <- "https://en.wikipedia.org/wiki/Athletics_at_the_2020_Summer_Olympics_%E2%80%93_Men%27s_marathon"

# Extract all tables in the page
tab <- read_html(url_20m) %>% 
  html_nodes("table") 
# Table of interest
dat_20m <- tab[[8]] %>% 
  html_table() 

# Data reshaping
colnames(dat_20m)[1:4] <- c("rank", "athlete", "country", "time")
dat_20m$rank[1:3] <- c(1:3)
dat_20m <- dat_20m %>% 
  mutate(rank = ifelse(rank=="—", NA, rank)) %>% 
  mutate(sb = ifelse(is.na(rank), NA, 0)) %>% 
  # sb = season best including "national record" and "personal best"
  mutate(sb = ifelse(Notes %in% c("SB","NR","PB"), 1, sb)) %>% 
  # dnf = did not finish including "did not start" and "disqualified"
  mutate(dnf = ifelse(Notes %in% c("DNF","DNS","DSQ"), 1, 0)) %>% 
  mutate(time = lubridate::hms(time)) %>% 
  select(rank, athlete, country, time, sb, dnf)

# Age data collection
urls <- read_html(url_20m) %>% 
  html_nodes("table") %>% 
  .[[8]] %>% 
  # extract all wiki URLs
  html_nodes("a[href *= '/wiki/' ]") %>% 
  html_attr("href") %>% 
  # extracted only URLs including althlete's personal pages
  .[!str_detect(.,"Olympics") & !str_detect(., "Bests") & !str_detect(., "conditions") & !str_detect(., "records")] %>%
  # save the URLs as character vectors
  paste("https://en.wikipedia.org", ., sep="")

# Extraction of Wiki text function
text_detect <- function(url){
  read_html(url) %>% 
    html_nodes("body") %>% 
    .[[1]] %>% 
    html_text()
}
# Application the function to all athletes at the Olympic game
wiki_text <- sapply(urls, text_detect)

# DOB detecting function by string processing
dob_detect <- function(string){
  string %>% str_extract(
  "(born \\d{1,2} (January|February|March|April|May|June|July|August|September|October|November|December).\\d{4})|(born (January|February|March|April|May|June|July|August|September|October|November|December) \\d{1,2}. \\d{4})"
  )
}
# Application the function to all athletes at the Olympic game
dobs <- sapply(wiki_text, dob_detect) %>% 
  as.character() %>% 
  str_replace("born ", "")
dobs_d <- dobs %>% str_extract("\\d{1,2}")
dobs_m <- dobs %>% str_extract("(January|February|March|April|May|June|July|August|September|October|November|December)")
dobs_y <- dobs %>% str_extract("\\d{4}")
dobs <- paste(dobs_y, dobs_m, dobs_d, sep="-") %>% ymd()
dat_20m$dob <- dobs

# Calculate age at the Olympic game
if (!require(eeptools)) {install.packages("eeptools", dependencies=TRUE)}
library(eeptools)
dat_20m <- dat_20m %>% 
  mutate(age = eeptools::age_calc(dob, ymd("2021-08-08"), units = "years") %>% floor)

# Add sex category 
dat_20m$sex <- "Men"
# Add game label
dat_20m$olympic <- "Tokyo2020"
#### Women at Tokyo 2020 ####
# URL of Women at Tokyo 2020
url_20w <- "https://en.wikipedia.org/wiki/Athletics_at_the_2020_Summer_Olympics_%E2%80%93_Women%27s_marathon"

# Extract all tables in the page
tab <- read_html(url_20w) %>% 
  html_nodes("table") 
# Table of interest
dat_20w <- tab[[8]] %>% 
  html_table()

# Data reshaping
colnames(dat_20w)[1:4] <- c("rank", "athlete", "country", "time")
dat_20w$rank[1:3] <- c(1:3)
dat_20w <- dat_20w %>% 
  mutate(rank = ifelse(rank=="–", NA, rank)) %>% 
  mutate(sb = ifelse(is.na(rank), NA, 0)) %>% 
  mutate(sb = ifelse(Notes %in% c("SB","NR","PB"), 1, sb)) %>% 
  mutate(dnf = ifelse(Notes %in% c("DNF","DNS","DSQ"), 1, 0)) %>% 
  mutate(time = lubridate::hms(time)) %>% 
  select(rank, athlete, country, time, sb, dnf)

# Age data collection
urls <- read_html(url_20w) %>% 
  html_nodes("table") %>% 
  .[[8]] %>% 
  html_nodes("a[href *= '/wiki/' ]") %>% # extract the individual athletes' wikis
  html_attr("href") %>% 
  .[!str_detect(.,"Olympics") & !str_detect(., "Bests") & !str_detect(., "conditions") & !str_detect(., "records")] %>%
  paste("https://en.wikipedia.org", ., sep="")

# Extraction of Wiki text by the function above
wiki_text <- sapply(urls, text_detect)

# DOB detection by string processing
dobs <- sapply(wiki_text, dob_detect) %>% 
  as.character() %>% 
  str_replace("born ", "")
dobs_d <- dobs %>% str_extract("\\d{1,2}")
dobs_m <- dobs %>% str_extract("(January|February|March|April|May|June|July|August|September|October|November|December)")
dobs_y <- dobs %>% str_extract("\\d{4}")
dobs <- paste(dobs_y, dobs_m, dobs_d, sep="-") %>% ymd()
dat_20w$dob <- dobs

# Calculate age at Tokyo 2020
dat_20w <- dat_20w %>% 
  mutate(age = eeptools::age_calc(dob, ymd("2021-08-07"), units = "years") %>% floor)

# Sex category 
dat_20w$sex <- "Women"
# Game label
dat_20w$olympic <- "Tokyo2020"
#### Men at Rio 2016 ####
# URL of Men at Rio 2016
url_16m <- "https://en.wikipedia.org/wiki/Athletics_at_the_2016_Summer_Olympics_%E2%80%93_Men%27s_marathon"

# Extract all tables in the page
tab <- read_html(url_16m) %>% 
  html_nodes("table") 
# Table of interest
dat_16m <- tab[[6]] %>% 
  html_table() 

# Data reshaping
colnames(dat_16m)[1:4] <- c("rank", "athlete", "country", "time")
dat_16m$rank[1:3] <- c(1:3)
dat_16m <- dat_16m %>% 
  mutate(rank = ifelse(rank=="—", NA, rank)) %>% 
  mutate(sb = ifelse(is.na(rank), NA, 0)) %>% 
  mutate(sb = ifelse(Notes %in% c("SB","NR","PB"), 1, sb)) %>% 
  mutate(dnf = ifelse(time=="DNF" | Notes %in% c("DNF","DNS","DSQ"), 1, 0)) %>% 
  mutate(time = ifelse(dnf==1, NA, time)) %>% 
  mutate(time = lubridate::hms(time)) %>% 
  select(rank, athlete, country, time, sb, dnf)

# Age data collection
urls <- read_html(url_16m) %>% 
  html_nodes("table") %>% 
  .[[6]] %>% 
  html_nodes("a[href *= '/wiki/' ]") %>% # extract the individual athletes' wikis
  html_attr("href") %>% 
  .[!str_detect(.,"Olympics") & !str_detect(., "Bests") & !str_detect(., "conditions") & !str_detect(., "records")] %>%
  paste("https://en.wikipedia.org", ., sep="")

# Extraction of Wiki text
wiki_text <- sapply(urls, text_detect)

# DOB detection by string processing
dobs <- sapply(wiki_text, dob_detect) %>% 
  as.character() %>% 
  str_replace("born ", "")
dobs_d <- dobs %>% str_extract("\\d{1,2}")
dobs_m <- dobs %>% str_extract("(January|February|March|April|May|June|July|August|September|October|November|December)")
dobs_y <- dobs %>% str_extract("\\d{4}")
dobs <- paste(dobs_y, dobs_m, dobs_d, sep="-") %>% ymd()
dat_16m$dob <- dobs

# Calculate age at Rio 2016
dat_16m <- dat_16m %>% 
  mutate(age = eeptools::age_calc(dob, ymd("2016-08-21"), units = "years") %>% floor)

# Sex category 
dat_16m$sex <- "Men"
# Game label
dat_16m$olympic <- "Rio2016"
#### Women at Rio 2016 ####
# URL of Women at Rio 2016
url_16w <- "https://en.wikipedia.org/wiki/Athletics_at_the_2016_Summer_Olympics_%E2%80%93_Women%27s_marathon"

# Extract all tables in the page
tab <- read_html(url_16w) %>% 
  html_nodes("table") 
# Table of interest
dat_16w <- tab[[6]] %>% 
  html_table() 

# Data reshaping
colnames(dat_16w)[1:4] <- c("rank", "athlete", "country", "time")
dat_16w$rank[1:3] <- c(1:3)
dat_16w <- dat_16w %>% 
  mutate(rank = ifelse(rank=="—", NA, rank)) %>% 
  mutate(sb = ifelse(is.na(rank), NA, 0)) %>% 
  mutate(sb = ifelse(Notes %in% c("SB","NR","PB"), 1, sb)) %>% 
  mutate(dnf = ifelse(time=="DNF" | Notes %in% c("DNF","DNS","DSQ"), 1, 0)) %>% 
  mutate(time = ifelse(dnf==1, NA, time)) %>% 
  mutate(time = lubridate::hms(time)) %>% 
  select(rank, athlete, country, time, sb, dnf)

# Age data collection
urls <- read_html(url_16w) %>% 
  html_nodes("table") %>% 
  .[[6]] %>% 
  html_nodes("a[href *= '/wiki/' ]") %>% # extract the individual athletes' wikis
  html_attr("href") %>% 
  .[!str_detect(.,"Olympics") & !str_detect(., "Bests") & !str_detect(., "conditions") & !str_detect(., "records")] %>%
  paste("https://en.wikipedia.org", ., sep="")

# Extraction of Wiki text
wiki_text <- sapply(urls, text_detect)

# DOB detection by string processing
dobs <- sapply(wiki_text, dob_detect) %>% 
  as.character() %>% 
  str_replace("born ", "")
dobs_d <- dobs %>% str_extract("\\d{1,2}")
dobs_m <- dobs %>% str_extract("(January|February|March|April|May|June|July|August|September|October|November|December)")
dobs_y <- dobs %>% str_extract("\\d{4}")
dobs <- paste(dobs_y, dobs_m, dobs_d, sep="-") %>% ymd()
dobs[128] <- "1980-10-12" %>% ymd()   # missing data (source: https://en.wikipedia.org/wiki/Graciete_Santana)
dat_16w$dob <- dobs

# Calculate age at Rio 2016
dat_16w <- dat_16w %>% 
  mutate(age = eeptools::age_calc(dob, ymd("2016-08-14"), units = "years") %>% floor)

# Sex category 
dat_16w$sex <- "Women"
# Game label
dat_16w$olympic <- "Rio2016"
#### Combine the 4 dataframes ####
dat <- rbind(dat_20m, dat_20w, dat_16m, dat_16w)

#### Rename the areas/countries ####
dat <- dat %>% 
  mutate(country = ifelse(country=="Chinese Taipei", "Taiwan", country)) %>% 
  mutate(country = ifelse(country=="Democratic Republic of the Congo", "Congo (Kinshasa)", country))
#### Lockdown area/countries ####
# Source: https://en.m.wikipedia.org/wiki/COVID-19_lockdowns
non_lockdown <- c("Burundi", "Iceland", "Japan", "Nicaragua",
                  "South Korea", "Sweden", "Taiwan", "Tanzania", "Uruguay")
dat <- dat %>% 
  mutate(lockdown = ifelse(country %in% non_lockdown, 0, 1))
#### Attendance before Tokyo2020 ####
prior_attend <- dat %>% 
  group_by(athlete) %>% 
  filter(n()>1) %>% 
  pull(athlete)
dat <- dat %>% 
  mutate(prior_attend = ifelse(olympic=="Tokyo2020" & athlete %in% prior_attend, 1, 0))
#### COVID-19 cases as of July 22, 2021 ####
# COVID-19 cases collected from [Github](https://github.com/CSSEGISandData/COVID-19).
#url <- getURL("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv")

url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
jhu <- read_csv(url) %>%
  # cases are cumulated so that I kept only "7/22/21", which is the start date of Tokyo2020
  dplyr::select(`Province/State`, `Country/Region`, `7/22/21`) %>% 
  # keep the resion to be consistent with other data
  mutate(`Country/Region2` = ifelse(is.na(`Province/State`), `Country/Region`,
                                    ifelse(`Province/State`=="Hong Kong", "Hong Kong",
                                           `Country/Region`))) %>% 
  group_by(`Country/Region2`) %>% 
  summarise(case_total = sum(`7/22/21`), .groups="drop") %>% 
  rename(country = `Country/Region2`) %>% 
  dplyr::select(country, case_total)
#### Population and GDP ####

## Population ##
# Read in Johns Hopkins UID lookup table
# Source: https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv
url <- "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv"
pop <- read_csv(url)

## GDP data ##
# Read in GDP data on WORLD BANK
# Source: https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv
url <- "https://api.worldbank.org/v2/en/indicator/NY.GDP.MKTP.CD?downloadformat=csv"
gdp <- read_csv("API_NY.GDP.MKTP.CD_DS2_en_csv_v2_3263806.csv")

# Join the population and the GDP data
temp <- left_join(pop %>% subset(is.na(Province_State)), 
                  gdp %>% 
                    rename(iso3 = `Country Code`,
                           gdp2016 = `2016`,
                           gdp2017 = `2017`,
                           gdp2018 = `2018`,
                           gdp2019 = `2019`,
                           gdp2020 = `2020`) %>% 
                    dplyr::select(iso3, `Country Name`, gdp2016:gdp2020),
                  by = "iso3") # combine the dataframes according to isco3
#### Combine the jhu data + (population + GDP)
jhu <- left_join(jhu, 
                 temp %>% rename(country = Country_Region), 
                 # join the dataframes by consistent country names
                 by="country") 

jhu <- jhu %>% 
  # calculate COVID-19 case per population
  mutate(case_pp = case_total/Population) %>% 
  # select only area/country names, case per population and GDP data
  dplyr::select(country, case_pp, gdp2016:gdp2020)

# Reclassify discrepant area/country names
jhu <- jhu %>% 
  mutate(country = ifelse(country %in% dat$country, country, 
                          case_when(country == "Taiwan*" ~ "Taiwan",
                                    country == "Czechia" ~ "Czech Republic",
                                    country == "Democratic Republic of the Congo" ~ "Congo",
                                    country == "United Kingdom" ~ "Great Britain",
                                    country == "Korea, South" ~ "South Korea",
                                    country == "US" ~ "United States")))
#### join the datasets: dat+(jhu+pop+gdp) ####
# No data on cases: North Korea, Palestine, Pueruto Rico, Refugee Olympic Team
dat <- left_join(dat, jhu, by="country")
#### Continent ####
# URL of continent category
url <- "https://www.newworldencyclopedia.org/entry/list_of_countries_by_continent"

# Extract all tables in the page
tab <- read_html(url) %>% 
  html_nodes("table") 
# Africa
africa <- tab[[1]] %>% 
  html_table() %>% 
  .[,c(1,3)]
africa <- paste0(africa[,1], africa[,2], "South Sudan") # add discrepant country name
# Asia
asia <- tab[[2]] %>% 
  html_table() %>% 
  .[,c(1,3)]
asia <- paste0(asia[,1], asia[,2], "Taiwan") # add discrepant country name
# Europe
europe <- tab[[3]] %>% 
  html_table() %>% 
  .[,c(1,3)]
europe <- paste0(europe[,1], europe[,2], "Great Britain") # add discrepant country name
# North America
n_america <- tab[[4]] %>% 
  html_table() %>% 
  .[,c(1,3)]
n_america <- paste0(n_america[,1], n_america[,2])
# South America
s_america <- tab[[5]] %>% 
  html_table() %>% 
  .[,c(1,3)]
s_america <- paste0(s_america[,1], s_america[,2])
# Oceania
oceania <- tab[[6]] %>% 
  html_table() %>% 
  .[,c(1,3)]
oceania <- paste0(oceania[,1], oceania[,2])

# add "continent" column according to the continent categories obtained above
dat <- dat %>% 
  mutate(continent = case_when(str_detect(africa, country) ~ "Africa",
                               country=="Congo (Kinshasa)" ~ "Africa", # add discrepant country name
                               str_detect(asia, country) ~ "Asia",
                               str_detect(europe, country) ~ "Europe",
                               str_detect(n_america, country) ~ "North America",
                               str_detect(s_america, country) ~ "South America",
                               str_detect(oceania, country) ~ "Oceania"))
#### Final data set ####
# time conversion into seconds for analyses
dat <- dat %>% 
  mutate(time_sec = period_to_seconds(time)) %>% 
  select(rank:time, time_sec, sb:last_col())

## Saving the dataframe
write_csv(dat, "data.csv")
#saveRDS(dat, "data.RData")

## Cleaning the environment
rm(list = ls()[!ls()=="dat"])

## Data
dat %>% slice(1:8) %>% knitr::kable()
rank athlete country time time_sec sb dnf dob age sex olympic lockdown prior_attend case_pp gdp2016 gdp2017 gdp2018 gdp2019 gdp2020 continent
1 Eliud Kipchoge Kenya 2H 8M 38S 7718 0 0 1984-11-05 36 Men Tokyo2020 1 1 0.0036285 6.918876e+10 7.896500e+10 8.777858e+10 9.550309e+10 9.884294e+10 Africa
2 Abdi Nageeye Netherlands 2H 9M 58S 7798 1 0 1989-03-02 32 Men Tokyo2020 1 1 0.1083166 7.835280e+11 8.318100e+11 9.135970e+11 9.101940e+11 9.138650e+11 Europe
3 Bashir Abdi Belgium 2H 10M 0S 7800 1 0 1989-02-10 32 Men Tokyo2020 1 0 0.0967716 4.757400e+11 5.015230e+11 5.434110e+11 5.332550e+11 5.153320e+11 Europe
4 Lawrence Cherono Kenya 2H 10M 2S 7802 1 0 1988-08-07 33 Men Tokyo2020 1 0 0.0036285 6.918876e+10 7.896500e+10 8.777858e+10 9.550309e+10 9.884294e+10 Africa
5 Ayad Lamdassem Spain 2H 10M 16S 7816 1 0 1981-10-11 39 Men Tokyo2020 1 0 0.0908839 1.232080e+12 1.309300e+12 1.420300e+12 1.393050e+12 1.281480e+12 Europe
6 Suguru Osako Japan 2H 10M 41S 7841 1 0 1991-05-23 30 Men Tokyo2020 0 0 0.0067817 5.003680e+12 4.930840e+12 5.036890e+12 5.148780e+12 4.975420e+12 Asia
7 Alphonce Simbu Tanzania 2H 11M 35S 7895 1 0 1992-02-14 29 Men Tokyo2020 0 1 0.0000085 4.977402e+10 5.332063e+10 5.700371e+10 6.113687e+10 6.240975e+10 Africa
8 Galen Rupp United States 2H 11M 41S 7901 1 0 1986-05-08 35 Men Tokyo2020 1 1 0.1043182 1.874510e+13 1.954300e+13 2.061190e+13 2.143320e+13 2.093660e+13 North America

Data information

  • rank: rank at each Olympic game
  • athlete: athlete names
  • country: area or country from which the athlete is
  • time: records at each Olympic game
  • time_sec: records at each Olympic game in seconds
  • sb: season best of the athlete; including national record and personal pest
  • dnf: did not finish; including did not start and disqualified
  • dob: date of birth
  • age: age (years) at the time of each Olympic game
  • sex: sex category
  • olympic: Olympic game name
  • case_pp: total case per population of the athlete country as of July 22, 2021 (Start date of Tokyo 2020)
  • gdp2016:2020: GDP from 2016 to 2020
  • continent: continent from which the athlete is
  • lockdown: yes if the country from which the athletes is implemented lockdown
  • prior_attend: Attendance before Tokyo (athletes at Rio 2016 are all “no”)

2. Visualization to Explore the Impact of the COVID-19 Pandemic on Olympic Marathon Performances (Marie Wu)

After we defined our variables and completed our data collection in marathon results at the Rio 2016 and Tokyo 2020 games, we further explored the basic composition of our data. In addition, we also explored how athletes’ marathon finishing times were similar or different in distribution across 2016 and 2020, as well as the impact of the COVID-19 pandemic on finishing times.

The results of attendance by gender and by continent across 2016 and 2020 are shown in Graph 1 and Graph 2, respectively. The distributions of finishing times at the Rio and Tokyo games for both genders are shown in Graph 3 and Graph 4.

The marathon results at the Rio and Tokyo games for both genders were compared by continent in Graph 5 and Graph 6. Graph 7A and 7B show the scatterplots for eyeballing men’s marathon results versus GDP across 2016 and 2020; likewise, Graph 8A and 8B show the scatterplots for eyeballing women’s marathon results versus GDP across 2016 and 2020. In Graphs 9 and 10, we explored the relationship between top finishing athletes and COVID-19 severity within the countries they represent via scatterplots and bar charts.

Graphs 11 and 12 allow us to view the countries from which the top 20 marathon athletes in both genders originate on COVID-19 heatmaps across 2016 and 2020.

load("data.RData")
library(dplyr)
library(tidyverse)
library(gridExtra)
library(dslabs)
library(ggplot2)
library(ggthemes)
library(scales)
library(lubridate)
library(ggpubr)

#Naming df by 2016 2020 Men Female
data(dat)
dat_16m <- dat %>% filter(olympic == "Rio2016", sex=="Men")
dat_16f <- dat %>% filter(olympic == "Rio2016", sex=="Women")
dat_20m <- dat %>% filter(olympic == "Tokyo2020", sex=="Men")
dat_20f <- dat %>% filter(olympic == "Tokyo2020", sex=="Women")
dat_Male<- dat %>% filter(sex=="Men")
dat_Female<- dat %>% filter(sex=="Women")

Graph 1. Bar Chart - Athlete Attendance by Gender, 2016 vs. 2020

# Graph 1. Bar Chart - Athlete Attendance by Gender, 2016 vs. 2020

dat %>% ggplot(aes(x =  sex, fill= olympic)) +
  geom_bar(position = position_dodge2(), aes(group = olympic)) + 
  xlab(" ")+ 
  ylab("Attendance") + 
  ggtitle("Athlete Attendance by Gender, 2016 vs. 2020") +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge2(width = .9), size = 2.5)

From this bar chart, we can see that when comparing the Tokyo 2020 Olympics to the Rio 2016 Olympics, athlete attendance for the marathon in Tokyo 2020 decreased in total number and across both genders.

In the Rio games, athlete attendance was 155 for men, and 157 for women. In the Tokyo games, athlete attendance was 106 for men, and 88 for women. The overall decrease was 118 persons, and 49 for men and 69 for women.

We further explored whether the decrease in athlete attendance was different by continent.

This graph does not include Refugee Olympic Team.

Graph 2. Bar Chart - Athlete Attendance by Continent, 2016 vs. 2020

# Graph 2.  Bar Chart - Athlete Attendance by Continent, 2016 vs. 2020

dat %>% filter(!is.na(continent))%>%
  ggplot(aes(x = continent, fill= olympic)) +
  geom_bar(position = position_dodge2(), aes(group = olympic)) +
  xlab(" ")+ 
  ylab("Attendance") + 
  ggtitle("Athlete Attendance by Continent, 2016 vs. 2020") +
  geom_text(stat = "count", aes(label = ..count..), position = position_dodge2(width = .9), size = 2.5)

From this bar chart, we can see that for most continents, including Africa, Asia, Europe, and South Africa, athlete attendance in Tokyo decreased by 17 - 35 persons compared to Rio, there was no change in attendance for North America, and a slight increase for Oceania.

Graph does not include athletes from the Refugee Olympic Team.

Graph 3. Histogram with and Boxplot - Men’s Marathon Results, Rio 2016 vs. Tokyo 2020

# Graph 3. Histogram with and Boxplot - Men's Marathon Results, Rio 2016 vs. Tokyo 2020

#a Scatter plot + Box plot
ps_Male <- dat_Male %>% filter(!is.na(continent))%>%
  ggplot(aes(olympic, time)) + 
  geom_boxplot(coef=3, color = "black") + 
  scale_y_time()+
  geom_jitter(width = 0.1, alpha = 0.2, color = "coral") +
  ylab("Finishing Time") + xlab(" ") 

ps_Female <- dat_Female %>% filter(!is.na(continent))%>%
  ggplot(aes(olympic, time)) + 
  geom_boxplot(coef=3, color = "black") + 
  geom_jitter(width = 0.1, alpha = 0.2, color = "#00AFBB") +
  scale_y_time()+
  ylab("Finishing Time") + xlab(" ")

#b Histogram
ph_Male <- dat_Male  %>% filter(!is.na(continent)) %>%
  ggplot(aes(time, ..density..)) +
  geom_histogram(binwidth = 70, color = "black") +
  scale_x_time()+
  facet_grid(olympic~., scales = "fixed") + 
  xlab(" ")

ph_Female <- dat_Female %>% filter(!is.na(continent)) %>%
  ggplot(aes(time, ..density..)) +
  geom_histogram(binwidth = 70, color="black") +
  scale_x_time()+
  facet_grid(olympic~., scales = "fixed")+ 
  xlab(" ")

#c Combined grid
SBH_male <- grid.arrange(ps_Male, ph_Male, ncol = 2, top = "Men's Marathon Results, Rio 2016 vs. Tokyo 2020" )

On the left of the graph, we can see that the y-axis of the box plot is represented by marathon finishing time. The lower the finishing time, the better the rank. The 25th quartile, median, and 75th quartile finishing times all improved.

On the right, the stacked histograms show the frequency distribution of athlete finishing time.

Graph 4. Histogram with and Boxplot - Women’s Marathon Results, Rio 2016 vs. Tokyo 2020

# Graph 4. Histogram with and Boxplot - Women's Marathon Results, Rio 2016 vs. Tokyo 2020

SBH_female <- grid.arrange(ps_Female, ph_Female, ncol = 2, top = "Women's Marathon Results, Rio 2016 vs. Tokyo 2020" )

On the left of the graph, we can see that the y-axis of the box plot is represented by marathon finishing time. The lower the finishing time, the better the rank. The 25th quartile, median, and 75th quartile finishing times all improved.

On the right, the stacked histograms show the frequency distribution of athlete finishing time.

Graph 5. Box Plot - Men’s Marathon Results by Continent, Rio 2016 vs. Tokyo 2020

# Graph 5. Box Plot - Men's Marathon Results by Continent, Rio 2016 vs. Tokyo 2020

dat_16m_pc <- dat_16m %>% select(country, time, olympic, gdp2016, continent, case_pp) %>% rename(`gdp`=`gdp2016`)
dat_20m_pc <- dat_20m %>% select(country, time, olympic, gdp2020, continent, case_pp) %>% rename(`gdp`=`gdp2020`)  

male_join <- full_join(dat_16m_pc, dat_20m_pc)

dat_gdp_male <- male_join %>% filter(!is.na(continent)) %>%
  ggplot(aes(continent, time, fill = olympic)) +
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_time(breaks = seq(6000,12000,1200)) + 
  ylab("Finishing Time") + xlab(" ")+ 
  ggtitle("Men's Marathon Results by Continent, Rio 2016 vs. Tokyo 2020 ")
dat_gdp_male 

From this box plot, we can see that, for athletes from Africa, Asia, Europe, and South America, men’s 25th quartile, median, and 75th quartile finishing times improved in Tokyo 2020 relative to Rio 2016.

Conversely, for athletes from both North America and Oceania, median finishing times regressed.

Graph 6. Box Plot - Women’s Marathon Results by Continent, Rio 2016 vs. Tokyo 2020

# Graph 6. Box Plot - Women's Marathon Results by Continent, Rio 2016 vs. Tokyo 2020
dat_16f_pc <- dat_16f %>% select(country, time, olympic, gdp2016, continent, case_pp) %>% rename(`gdp`=`gdp2016`)
dat_20f_pc <- dat_20f %>% select(country, time, olympic, gdp2020, continent, case_pp) %>% rename(`gdp`=`gdp2020`)  

female_join <- full_join(dat_16f_pc, dat_20f_pc)

dat_gdp_female <- female_join %>% filter(!is.na(continent)) %>%
  ggplot(aes(continent, time, fill = olympic)) +
  geom_boxplot() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_time(breaks = seq(6000,12000,1200)) + 
  ylab("Finishing Time") + xlab(" ")+ 
  ggtitle("Women's Marathon Results by Continent, Rio 2016 vs. Tokyo 2020")

dat_gdp_female

From this box plot, for athletes from Africa, Asia, Europe, and South America, women’s finishing times improved in Tokyo 2020 compared to Rio 2016, similar to what was seen in the men’s finishing times. In addition, times from North America improved. In Oceania, median finishing times regressed.

Graph 7. Scatterplots: Men’s Marathon Results by GDP in 2016 and 2020

# Graph 7. Scatterplots: Men's Marathon Results by GDP in 2016 and 2020

# 7.A Scatterplots: Men's Marathon Results by GDP in 2016
p_16mgdp <- dat_16m %>% filter(!is.na(continent)) %>% 
  ggplot(aes(gdp2016/1000000000, time, col = continent)) + 
  geom_point(alpha = 0.5)  + 
  geom_hline("First Half", yintercept=8448, color="blue" ) +
  geom_hline("Top 10", yintercept=7951, color="coral") +
  scale_x_continuous(trans = "log10", labels = comma) + 
  scale_y_time()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("GDP for 2016 (billion USD)")+
  ylab("Finishing Time") +
  geom_label(aes(4,8447,label = "Upper Half" , vjust = -0.3), color="blue") +
  geom_label(aes(3,7950,label ="Top 10" , vjust = -0.3), color="coral" ) +
  ggtitle("Men's Marathon Results by GDP in 2016")

Graph 7.A Men’s Marathon Results by GDP in 2016

p_16mgdp

# 7.B Scatterplots: Men's Marathon Results by GDP in 2020

p_20mgdp <- dat_20m %>% filter(!is.na(continent)) %>% 
  ggplot(aes(gdp2016/1000000000, time, col = continent)) + 
  geom_point(alpha = 0.5)  + 
  geom_hline("First Half", yintercept=8240, color="blue" ) +
  geom_hline("Top 10", yintercept=7934, color="coral") +
  scale_x_continuous(trans = "log10", labels = comma) + 
  scale_y_time() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("GDP for 2020 (billion USD)")+
  ylab("Finishing Time") +
  geom_label(aes(5,8239,label = "Upper Half" , vjust = -0.3), color="blue") +
  geom_label(aes(3,7933,label ="Top 10" , vjust = -0.3), color="coral" ) +
  ggtitle("Men's Marathon Results by GDP in 2020")

Graph 7.B Men’s Marathon Results by GDP in 2020

p_20mgdp

By eyeballing the above two scatterplots, let’s explore the relationship between the GDPs of the countries where athletes originate and the finishing time.

Here, we plot the x-axis with the GDP of the country where athletes originate from and the y-axis with finishing time. Each point on the plot represents an athlete who finished. At the Rio Olympics, 155 male athletes attended the marathon, and 139 finished. Athletes who qualified in the upper half completed the race within 8447 seconds (2 hrs 20 mins 47 secs). Athletes who finished top 10 did so within 7949 seconds (2 hrs 12 mins 29 secs).

At the Tokyo Olympics, 106 male athletes attended the marathon, and 76 finished. Athletes who qualified in the upper half completed their race within 8239 seconds (2 hrs 17 mins 19 secs). Athletes who finished top 10 did so within 7933 seconds (2 hrs 12 mins 13 secs).

Since the athletes who attended the Rio and Tokyo games were mostly different, and considering inflation may have affected GDP between 2016 and 2020, we used separate scatterplots to explore the differences or similarities of the two plots. By comparing 2016 and 2020 men’s scatterplots, we can see that the distribution in 2016 is more evenly spread along the x-axis than in 2020. GDPs of the countries where athletes originated from were higher in 2020 among both the overall male marathon athlete population and those who qualified the upper half.

Graph 8. Scatterplots: Women’s Marathon Results by GDP in 2016 and 2020

## Graph 8. Scatterplots: Women's Marathon Results by GDP in 2016 and 2020

# 8.A Scatterplots: Women's Marathon Results by GDP in 2016 

p_16fgdp <- dat_16f %>% filter(!is.na(continent)) %>% 
  ggplot(aes(gdp2016/1000000000, time, col = continent)) + 
  geom_point(alpha = 0.5)  + 
  geom_hline("First Half", yintercept=9749, color="blue" ) +
  geom_hline("Top 10", yintercept=8917, color="coral") +
  scale_x_continuous(trans = "log10", labels = comma) + 
  scale_y_time() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("GDP for 2016 (billion USD)")+
  ylab("Finishing Time") +
  geom_label(aes(7,9748,label = "Upper Half" , vjust = -0.3), color="blue") +
  geom_label(aes(6,8916,label ="Top 10" , vjust = -0.3), color="coral" ) +
  ggtitle("Women's Marathon Results by GDP in 2016")

Graph 8.A Women’s Marathon Results by GDP in 2016

p_16fgdp

# 8.B Scatterplots: Women's Marathon Results by GDP in 2020

p_20fgdp <- dat_20f %>% filter(!is.na(continent)) %>% 
  ggplot(aes(gdp2020/1000000000, time, col = continent)) + 
  geom_point(alpha = 0.5)  + 
  geom_hline("First Half", yintercept=9339, color="blue" ) +
  geom_hline("Top 10", yintercept=9074, color="coral") +
  scale_x_continuous(trans = "log10", labels = comma) + 
  scale_y_time() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  xlab("GDP for 2020 (billion USD)")+
  ylab("Finishing Time") +
  geom_label(aes(3,9340,label = "Upper Half" , vjust = -0.3), color="blue") +
  geom_label(aes(3,9075,label ="Top 10" , vjust = -0.3), color="coral" ) +
  ggtitle("Women's Marathon Results by GDP in 2020")

Graph 8.B Women’s Marathon Results by GDP in 2020

p_20fgdp

From the above two scatterplots we can see there is a more even distribution of marathon athlete finishing time along the x-axis (GDP) in 2016.

At the Rio Olympics, 157 female athletes attended the marathon, and 133 finished. Athletes who qualified in the upper half completed the race within 9748 seconds (2 hrs 42 mins 28 secs). Athletes who finished top 10 did so within 8916 seconds (2 hrs 28 mins 36 secs).

At the Tokyo Olympics, 88 female athletes attended the marathon, and 73 finished. Athletes who qualified in the upper half completed their race within 9339 seconds (2 hrs 35 mins 39 secs). Athletes who finished top 10 did so within 9074 seconds (2 hrs 31 mins 14 secs).

Similarly, we used separate scatterplots to explore the differences or similarities of the two plots. By comparing 2016 and 2020 women’s scatterplots, we can see that the distribution in 2016 is more evenly spread along the x-axis than in 2020. GDPs of the countries where athletes originated from were higher in 2020 among both the overall female marathon athlete population and those who qualified the upper half.

Graph 9. Men’s and Women’s Marathon Results by COVID-19 Cases Per Capita

# Graph 9. Men's and Women's Marathon Results by COVID-19 Cases Per Capita

p_20fcasepp <- dat_20f %>% filter(!is.na(continent)) %>% 
  ggplot(aes(case_pp, time, col = continent)) +
  geom_point(alpha = 0.5)  + 
  geom_hline("First Half", yintercept=9340, color="blue") + 
  geom_hline("Top 10", yintercept=9075, color="coral") +
  scale_y_time() + 
  xlab("Women")+
  ylab("")+
  geom_label(aes(0.14,9340,label = "Upper Half" , vjust = -0.3), color="blue" , size = 3) +
  geom_label(aes(0.14,9075,label ="Top 10" , vjust = -0.3), color="coral" , size = 3) 
 
p_20mcasepp <- dat_20m %>% filter(!is.na(continent)) %>% 
  ggplot(aes(case_pp, time, col = continent)) +
  geom_point(alpha = 0.5)  + 
  geom_hline("Upper Half", yintercept=8240, color="blue") + 
  geom_hline("Top 10", yintercept=7934, color="coral")  +
  scale_y_time() + 
  xlab("Men")+
  ylab("") +
  geom_label(aes(0.14,8240,label = "Upper Half" , vjust = -0.3), color="blue", size = 3) +
  geom_label(aes(0.14,7934,label ="Top 10" , vjust = -0.3), color="coral", size = 3 ) 

arrangedplot <- ggarrange(p_20mcasepp, p_20fcasepp,common.legend = TRUE )
annotate_figure(arrangedplot,
               top = text_grob("Men's and Women's Marathon Results by COVID-19 Cases Per Capita", color = "black", size = 12),
               left = text_grob("Finishing Time", rot = 90),
               bottom = text_grob("COVID-19 Cases Per Capita", color = "black")) 

Here, the x-axis represents the total cumulative number of COVID-19 cases in countries of athlete origin divided by country population, from the beginning of the pandemic until July 22, 2021. The y-axis represents the marathon finishing time at the Tokyo 2020 games.

Here, we can see from the scatterplots that, among those who were able to qualify in the upper half, there were M-shaped distributions for both genders, suggestive of the fact that athletes who finished in the upper half came from countries of both high and low COVID-19 cases per capita.

In women, among the top 10, more athletes came from countries with fewer COVID-19 cases per capita. In men, we do not see this trend. Yet, while viewing these scatterplots, we should keep in mind issues such as under-reporting of COVID-19 cases or variations in case-compiling methods in different countries.

Graph 10. Men’s and Women’s Upper Half Marathon Finishers by COVID-19 Cases Per Capita

# Graph 10 Men's and Women's Upper Half Marathon Finishers by COVID-19 Cases Per Capita
#Top half +severe 
top_half_severity_m20 <- dat_20m %>% filter(!is.na(time)) %>% filter(!is.na(case_pp)) %>% 
  mutate(top_half = ifelse(rank((time_sec))<=38, "Upper 50% Place","Lower 50% Place" )) %>% 
  mutate(severe = ifelse(case_pp <= 0.0472, "Case Count Below Median", "Case Count Above Median"))  

top_half_severity_f20 <- dat_20f %>% filter(!is.na(time) ) %>% 
  mutate(top_half= ifelse(rank((time_sec))<=37, "Upper 50% Place" , "Lower 50% Place")) %>% 
  mutate(severe= ifelse(case_pp <= 0.0472, "Case Count Below Median", "Case Count Above Median"))  

#define top half severity
top_half_severity <- dat%>% filter(rank((case_pp))<=253) %>% arrange(.,case_pp)
#Use 0.0472 to -- case severity
 
PP1 <- top_half_severity_m20 %>% 
  ggplot(aes(x = as.factor(top_half), fill= as.factor(severe))) +
  geom_bar(position = position_stack(), aes(group = as.factor(severe))) +
  geom_text(stat = "count", aes(label = ..count..), position = position_stack(vjust = 0.5), size = 2.5) +
  xlab("Men")+ 
  ylab("") 

PP1<- PP1 + guides(fill=guide_legend(title="COVID-19 Case Count Per Capita"))

PP2 <- top_half_severity_f20 %>% ggplot(aes(x = as.factor(top_half), fill= as.factor(severe))) +
  geom_bar(position = position_stack(), aes(group = as.factor(severe))) +
  geom_text(stat = "count", aes(label = ..count..), position = position_stack(vjust = 0.5), size = 2.5) +
  xlab("Women")+ 
  ylab("") 

P2<- PP2 + guides(fill=guide_legend(title="COVID-19 Case Count Per Capita"))

arrange1<- ggarrange(PP1, PP2,  common.legend = TRUE, legend = "bottom") 
annotate_figure(arrange1,
               top = text_grob("Men's and Women's Upper Half Marathon Finishers by COVID-19 Cases Per Capita", color = "black", size = 13),
               left = text_grob("Athlete Count", rot = 90)) 

Here, the x-axis represents the upper or lower 50% in places achieved by male and female athletes. The y-axis represents athlete counts. The colored bars represent the number of athletes from countries below or above the median COVID-19 case count per capita. The median COVID-19 case count per capita was defined by the median of the total cumulative number of COVID-19 cases in countries of athlete origin divided by country population, from the beginning of the pandemic until July 22, 2021, from all samples in the dataset.

We can see that, in men, for both upper and lower half qualifiers, roughly equal numbers of athletes were from countries below and above the median cases per capita.

In women, however, more athletes who qualified in the upper half were from countries with a below-median case per capita.

Graph 11. Countries of Top 20 Male Finishers on COVID-19 Heatmap

##Graph 11. Countries of Top 20 Male Finishers on COVID-19 Heatmap
#Top 20 codes 
top20_20m <- dat_20m %>% filter(rank((time_sec))<=20, !is.na(continent)) %>%  group_by(country) %>% arrange(.,country)%>% count() %>%  as.data.frame(top20_20m)

top20_20m <-top20_20m  %>% mutate(
  lat = c(50.503887,35.861660,4.570868,15.339000, 46.6487132, 31.046051, 41.87194, 36.204824, -0.023559, 31.791702, 52.132633, 40.463667, -6.369028, 37.09024), 
  long=c(4.469936,104.195396,-74.297333, 38.937111, 2.6215658, 34.851612, 12.56738, 138.252924, 37.906193, -7.09262, 5.291266, -3.74922, 34.888822, -95.712891 ))

top20_16m <- dat_16m %>% filter(rank((time_sec))<=20, !is.na(continent)) %>%  group_by(country) %>% arrange(.,country)%>% count() %>%  as.data.frame(top20_16m)

top20_16m <-top20_16m  %>% mutate(
  lat = c( -14.235004   ,56.130366 , 11.825138 , -1.831239 , 15.179384, 9.145   ,       55.378051, 36.204824, -0.023559, 52.132633,60.472024, 46.818188, -6.369028, 38.963745,  1.373333, 48.379433, 37.09024   ), 
  long=c( -51.92528 , -106.346771   , 42.590275 , -78.183406, 39.782334, 40.489673, -3.435973, 138.252924, 37.906193, 5.291266, 8.468946, 8.227512, 34.888822, 35.243322, 32.290275, 31.16558, -95.712891))

top20_20f <- dat_20f %>% filter(rank((time_sec))<=20, !is.na(continent)) %>%  group_by(country) %>% arrange(.,country)%>% count() %>%  as.data.frame(top20_20f)

top20_20f <-top20_20f  %>% mutate(
  lat = c(-25.274398, 25.930414, 53.709807, 56.130366, 9.145, 51.165691, 36.204824, -0.023559   , -29.609988, -22.95764, 51.919438 , -30.559482 , 46.818188 , 1.373333 , 37.09024 ), 
  long= c( 133.775136, 50.637772, 27.953389, -106.346771, 40.489673,  10.451526, 138.252924, 37.906193 , 28.233608, 18.49041    , 19.145136 , 22.937506, 8.227512, 32.290275,-95.712891 ))

top20_16f <- dat_16f %>% filter(rank((time_sec))<=20, !is.na(continent)) %>%  group_by(country) %>% arrange(.,country)%>% count() %>%  as.data.frame(top20_16f)

top20_16f <-top20_16f  %>% mutate(
  lat = c( -25.274398, 25.930414, 53.709807 ,  9.145 , 53.41291, 41.87194 , 36.204824, -0.023559    , 56.879635, 55.169438, 40.339852, -9.189967 , 39.399872    , 37.09024 ), 
  long= c(  133.775136, 50.637772 , 27.953389 ,40.489673, -8.24389  ,12.56738, 138.252924, 37.906193 , 24.603189, 23.881275, 127.510093, -75.015152 ,  -8.224454    , -95.712891))

library(zoo)
library(maps)
world_map = map_data("world")
jhu <- read_csv("time_series_covid19_confirmed_global.csv")
# Reshape to long format and convert the dates to date types
# Your code here
jhu_long <- jhu %>% gather(date, cases, `1/22/20`:`10/31/20`)
jhu_long <- jhu_long %>% mutate(date = mdy(date))
#class(jhu_long$date)
# Sum up the number of cases within each country and date
# Your code here
jhu_sum <- jhu_long %>% 
   select(`Country/Region`, date, cases)  %>% 
   group_by(`Country/Region`, date) %>% 
   summarize(total_cases = sum(cases, na.rm=TRUE), .groups = "drop") %>%
   ungroup()
# Calculate 7-day rolling average of new cases
# Add 7-day rolling average of new cases to data frame from question 5
jhu_sum <- jhu_sum %>% 
   group_by(`Country/Region`) %>%
   arrange(date) %>%
   mutate(cases_increase = total_cases - lag(total_cases)) %>% 
   ungroup() %>% 
   arrange(`Country/Region`)

jhu_sum <- jhu_sum %>% 
   group_by(`Country/Region`) %>%
   arrange(date) %>%
   mutate(cases_7rolling = rollmean(cases_increase, k = 7, fill = NA)) %>% 
   ungroup() %>% 
   arrange(`Country/Region`)
 
uid_lookup_table = read_csv("UID_ISO_FIPS_LookUp_Table.csv")
 
# Extract the country-level populations and use nice names
uid_pop <- uid_lookup_table %>% 
  subset(is.na(Province_State)) %>% 
  rename(country = "Country_Region", population = "Population") %>% 
  select(country, population)
# Join the country populations (uid_pop) to the Johns Hopkins data
jhu_sum <- left_join(jhu_sum, uid_pop, by= c("Country/Region" = "country"))
# Create a new cases per million variable (7-day average)
jhu_sum <- jhu_sum %>% 
   mutate(new_cases7_per_million = cases_7rolling /(population/1000000))
# Key for discrepant country names in Johns Hopkins and world map data
country_key = data.frame(rbind(c("Antigua and Barbuda", "Antigua"), 
                               c("Burma", "Myanmar"), 
                               c("Cabo Verde", "Cape Verde"), 
                               c("Congo (Kinshasa)", 
                                 "Democratic Republic of the Congo"), 
                               c("Congo (Brazzaville)", 
                                 "Republic of Congo"), 
                               c("Cote d'Ivoire", "Ivory Coast"), 
                               c("Czechia", "Czech Republic"), 
                               c("Eswatini", "Swaziland"), 
                               c("Holy See", "Vatican"), 
                               c("Korea, South", "South Korea"), 
                               c("North Macedonia", "Macedonia"), 
                               c("Saint Kitts and Nevis", "Saint Kitts"), 
                               c("Saint Vincent and the Grenadines", 
                                 "Saint Vincent"), 
                               c("Taiwan*", "Taiwan"), 
                               c("Trinidad and Tobago", "Trinidad"), 
                               c("United Kingdom", "UK"), 
                               c("US", "USA")))
names(country_key) = c("JHU", "map")

# Create named vector for recoding country names
recode_map <- country_key$JHU; names(recode_map) = country_key$map
# Recode country names in world map data to match with Johns Hopkins
world_map <- world_map %>%
  mutate(region = recode(region, !!!recode_map))
# Filter Johns Hopkins data for July 1, 2020 and join with world_map data frame.
# When joining, remember that the variable referring to countries has a different name in the JHU and world map data frames.
jhu_0701 <- jhu_sum %>%
   filter(date == "2020-07-01") 
jhu_0701 <- left_join(jhu_0701, world_map, by= c("Country/Region" = "region"))
# Heatmap of cases per million on Jul 1, 2020.
library(RColorBrewer)
mappie<- jhu_0701 %>% 
  ggplot() +
   geom_polygon(color = "black", aes(x = long, y = lat, group = group, fill=new_cases7_per_million)) +  coord_fixed(1.3)  +
   theme(panel.grid.major = element_blank(), 
         panel.background = element_blank(),
         axis.title = element_blank(), 
         axis.text = element_blank(),
         axis.ticks = element_blank()) +
   scale_fill_gradientn(colors = brewer.pal(8, "Oranges"), trans = "sqrt")  +  
   theme(plot.title = element_text(hjust = 0.5)) +
   labs(fill="Cases per million")
mappie_1<-mappie + 
   ggtitle("Countries of Top 20 Male Finishers on COVID-19 Heatmap") +
  geom_point(data= top20_20m, aes(x=long, y=lat, size=n , color='2020'), alpha=0.5) + 
  geom_point(data= top20_16m, aes(x=long, y=lat, size=n, color='2016'), alpha=0.7) +
  scale_color_manual(name='Year',
                     breaks=c('2020', '2016'),
                     values=c('2020'='blue', '2016'='green'))

mappie_2<-mappie + 
   ggtitle("Countries of Top 20 Female Finishers on COVID-19 Heatmap")+ 
  geom_point(data= top20_20f, aes(x=long, y=lat, size=n , color='2020'), alpha=0.5) + 
  geom_point(data= top20_16f, aes(x=long, y=lat, size=n, color='2016'), alpha=0.7) +
  scale_color_manual(name='Year',
                     breaks=c('2020', '2016'),
                     values=c('2020'='blue', '2016'='green'))

mappie_1

On this COVID-19 heatmap, countries represented by top 20 male marathon finishers are represented by a shade of orange, graded from light to dark, to reflect each country’s severity of COVID-19 at the beginning of July 2021. Severity is calculated by the 7-day average of COVID-19 cases on July 1st, 2021. The green dots are the top 20 male finishers at the 2016 games; the blue dots are the top 20 male finishers at the 2020 games; the size of the dots reflect the numbers of athletes who qualified in the top 20. This graph allows us to see the change in top 20 finishers and their represented countries across 2016 and 2020 on a COVID-19 heatmap.

Graph 12. Countries of Top 20 Female Finishers on COVID-19 Heatmap

## Graph 12. Countries of Top 20 Female Finishers on COVID-19 Heatmap

mappie_2  

On this COVID-19 heatmap, countries represented by top 20 female marathon finishers are represented by a shade of orange, graded from light to dark, to reflect each country’s severity of COVID-19 at the beginning of July 2021. Severity is calculated by the 7-day average of COVID-19 cases on July 1st, 2021. The green dots are the top 20 female finishers at the 2016 games; the blue dots are the top 20 female finishers at the 2020 games; the size of the dots reflect the numbers of athletes who qualified in the top 20. This graph allows us to see the change in top 20 finishers and their represented countries across 2016 and 2020 on a COVID-19 heatmap.

3. Shiny App Analysis (Yuki Zhong)

# Libraries
library(tidyverse)
library(shiny)
library(ggthemes)
library(shinythemes)

# Read in dataframe
dat <- read_csv("data.csv")
## Rows: 506 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): rank, athlete, country, time, sex, olympic, continent
## dbl  (12): time_sec, sb, dnf, age, lockdown, prior_attend, case_pp, gdp2016,...
## date  (1): dob
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
shinyApp(ui = fluidPage( 
    # Change theme 
    theme = shinytheme("superhero"),
   
    #name title Panel
    titlePanel("Difference of time between 2016 and 2020 Olympics marathon record for each rank"),
    
    br(),
    
    # Sidebar
    sidebarLayout(
        
        # Widgets for selection
        sidebarPanel(
            # Explanatory text
            p("This Shiny apps allows you to use a slide bar to see the change of differene in each ranking between 2016 and 2020 Olympics, seperately for men and women."),
            
            br(),
            
            # Radio buttons that allows the user to choose a gender
            radioButtons(inputId = "sex", label = "Select men or women",
                         choices = c("Men", "Women")),
            
            # Input: rank slider with basic animation
            sliderInput("rank", "rank",
                        min = 1, max = 73,
                        value = 1, 
                        step = 1,
                        ticks = TRUE,  
                        animate = animationOptions(interval = 800) # add play speed
        )),
        
        # Main panel
        mainPanel(
          
            # Plot
            plotOutput("plot"),
            br(), 
            # Message about the difference in seconds for each rank
            textOutput("diff")
        )
    )
),

# Define server logic
server = function(input, output){
    
    # Make the selected line for the selected gender
    output$plot = renderPlot({
      
        # create the dataset for the plot
        i <- 1
        n <- input$rank+1
        y_diff<- c() #define an empty list that will be the y-value for the plot
        x_rank<- c() #define an empty list that will be the x-value for the plot
      
        #use a loop function to keep adding the previous differences and ranks to the list given a specific rank 
        while (i< n) {
          dat_diff <- dat %>% filter(sex == input$sex & rank == i) 
          y_diff <- c(y_diff,  diff(dat_diff$time_sec))
          x_rank <- c(x_rank, i)
          i <- i+1
        }
        
        # Scatterplot for the difference in seconds vs. rank
        ggplot() +
        theme_economist() +
        geom_line(aes(x = x_rank, y=y_diff), color = ifelse(input$sex == "Men", 'deepskyblue2','hotpink2'), size = 1) +
        xlab("rank") +
        ylab("Difference (seconds Rio - Tokyo)") +
        geom_hline(yintercept = 0, colour = 'black') 
        
            
    })
    
    # Identify which Olympics had a faster record by how many seconds, for each rank and gender
    output$diff = renderText({
        dat_diff <- dat %>% filter(sex == input$sex & rank == input$rank) 
        paste0("For rank ", input$rank, " in ",input$sex, "'s Marathon , ", ifelse(diff(dat_diff$time_sec)<0,  "the record in Tokyo 2020", "the record in Rio 2016"), " is ", abs(diff(dat_diff$time_sec)), " seconds faster")
    })
})
## 
## Listening on http://127.0.0.1:3400
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

Interpretation: From the interactive graphs, we can see that for both men and women, Tokyo 2020 Olympics seemed to have faster records overall. For men, the records in Rio are faster in high ranks between 1-16 and middle ranks between 30-40. For women, the records in Rio are faster only sparsely between the rank 30 and 50. In addition, we can see that the difference of time is more exaggerated in high ranks and low ranks, while the differences in the middle ranks are not too many.

4. Regression (Satoko Ugai)

#This study examined whether there was a difference in marathon records between pre- and post -COVI19 Olympics. A total of 506 athletes participating Rio2016 and Tokyo2020 Olympics were included in this study.The primary outcome was a marathon record. The predictor was Rio Olympic2016 (i.e.,pre-COVID19 Olympic) and Tokyo Olympic2020 (i.e.,post-COVID19 Olympic).

# Libraries
library(tidyverse)

# Read in the dataframe
dat <- read_csv("data.csv")
## Rows: 506 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (7): rank, athlete, country, time, sex, olympic, continent
## dbl  (12): time_sec, sb, dnf, age, lockdown, prior_attend, case_pp, gdp2016,...
## date  (1): dob
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Making table1
library(gtsummary)
labels <-  list(time_sec ~ "Marathon record,mean sec (SD)", 
                sb ~ "Season best, n (%)",
                dnf ~ "Did not finish, n (%)",
                age ~ "Age, mean year(SD)",
                sex ~ "Sex,n (%)",
                continent ~ "Continent,n(%)",
                lockdown ~ "Number of athletes from locked down countries")
#making table1 overall
tbl1<- tbl_summary(data=dat %>% 
                   
                   dplyr::select(olympic,time_sec,sb,dnf,age,sex,continent,lockdown),
                   by=olympic,
                   label=labels,
                   statistic=list(c(time_sec,age)~"{mean}({sd})"),
                   missing="no") %>%
  add_p(test=list(c(time_sec,age)~"t.test",
                  all_categorical()~"chisq.test.no.correct")) %>%
                  modify_caption("**Table 1. demographic Characteristics of all athletes**") 
tbl1
Table 1. demographic Characteristics of all athletes
Characteristic Rio2016, N = 3121 Tokyo2020, N = 1941 p-value2
Marathon record,mean sec (SD) 9,165(870) 8,887(710) <0.001
Season best, n (%) 24 (8.1%) 98 (66%) <0.001
Did not finish, n (%) 39 (12%) 45 (23%) 0.002
Age, mean year(SD) 30.7(4.9) 31.0(4.4) 0.4
Sex,n (%) 0.3
Men 155 (50%) 106 (55%)
Women 157 (50%) 88 (45%)
Continent,n(%) 0.034
Africa 61 (20%) 44 (23%)
Asia 74 (24%) 34 (18%)
Europe 106 (34%) 71 (37%)
North America 20 (6.5%) 20 (10%)
Oceania 6 (1.9%) 9 (4.7%)
South America 43 (14%) 15 (7.8%)
Number of athletes from locked down countries 290 (93%) 179 (92%) 0.8

1 Mean(SD); n (%)

2 Welch Two Sample t-test; Pearson's Chi-squared test

#Table1 summarizes demographic characteristics of all athletes according to Rio Olympic 2016 and Tokyo Olympic 2020. Two groups were compared regarding the demographic characteristics with t test for continuous variables and chi-squared test for categorical variables. There were 312 athletes who participated in the Rio Olympics and 194 athletes who participated in the Tokyo Olympics.The mean marathon record is 9165 seconds (2H32M45sec) in Rio Olympic and 8887 seconds (2H28M07sec) in Tokyo Olympic. The t test showed that a p-value was <0.001.We therefore conclude that there is a statistically significant difference between marathon records and Olympics among male athletes.The proportion of athletes who achieved a season's best and those who dropped out of the marathon race was higher in the Tokyo Olympics than that in the Rio Olympics. By continent, the percentages of athletes from Asia and South America was lower in the Tokyo Olympics than those in the Rio Olympics.

#making table2 men
labels1 <-  list(time_sec ~ "Marathon record,mean sec (SD)", 
                sb ~ "Season best, n (%)",
                dnf ~ "Did not finish, n (%)",
                age ~ "Age, mean year(SD)",
                continent ~ "Continent,n (%)",
                lockdown ~ "Number of athletes from locked down countries")
tbl2<- tbl_summary(data=dat %>% 
                   filter(sex=="Men") %>%
                   dplyr::select(olympic,time_sec,sb,dnf,age,continent,lockdown),
                   by=olympic,
                   label=labels1,
                   statistic=list(c(time_sec,age)~"{mean}({sd})"),
                   missing="no") %>%
  add_p(test=list(c(time_sec,age)~"t.test",
                  all_categorical()~"chisq.test.no.correct")) %>%
                  modify_caption("**Table 2. demographic Characteristics of male athletes**") 
## Warning for variable 'continent':
##  simpleWarning in stats::chisq.test(x = c("Africa", "Europe", "Europe", "Africa", : Chi-squared approximation may be incorrect
tbl2
Table 2. demographic Characteristics of male athletes
Characteristic Rio2016, N = 1551 Tokyo2020, N = 1061 p-value2
Marathon record,mean sec (SD) 8,543(465) 8,324(366) <0.001
Season best, n (%) 12 (8.6%) 50 (66%) <0.001
Did not finish, n (%) 16 (10%) 30 (28%) <0.001
Age, mean year(SD) 29.9(4.8) 31.1(4.4) 0.048
Continent,n (%) 0.3
Africa 39 (25%) 27 (26%)
Asia 34 (22%) 18 (17%)
Europe 44 (29%) 35 (33%)
North America 11 (7.1%) 11 (10%)
Oceania 3 (1.9%) 5 (4.8%)
South America 23 (15%) 9 (8.6%)
Number of athletes from locked down countries 141 (91%) 98 (92%) 0.7

1 Mean(SD); n (%)

2 Welch Two Sample t-test; Pearson's Chi-squared test

#Table2 summarizes demographic characteristics of male athletes according to Rio Olympic 2016 and Tokyo Olympic 2020.The mean marathon record is 8542 seconds (2H22M22sec) in Rio Olympic and 8324 seconds (2H18M44sec) in Tokyo Olympic. The t test showed that a p-value was <0.001.We therefore conclude that there is a statistically significant difference between marathon records and Olympics among male athletes. The percentages of athletes who achieved season best and those who dropped out of the marathon race was significantly higher at the Tokyo Olympics than those at the Rio Olympics. Like the characteristics of overall athletes, the percentages of male athletes from Asia and South America was lower in the Tokyo Olympics than those in the Rio Olympics.

#making table3 women
tbl3<- tbl_summary(data=dat %>% 
                   filter(sex=="Women") %>%
                   dplyr::select(olympic,time_sec,sb,dnf,age,continent,lockdown),
                   by=olympic,
                   label=labels1,
                   statistic=list(c(time_sec,age)~"{mean}({sd})"),
                   missing="no") %>%
  add_p(test=list(c(time_sec,age)~"t.test",
                  all_categorical()~"chisq.test.no.correct")) %>%
                  modify_caption("**Table 3. demographic Characteristics of female athletes**") 
## Warning for variable 'continent':
##  simpleWarning in stats::chisq.test(x = c("Africa", "Africa", "North America", : Chi-squared approximation may be incorrect
tbl3
Table 3. demographic Characteristics of female athletes
Characteristic Rio2016, N = 1571 Tokyo2020, N = 881 p-value2
Marathon record,mean sec (SD) 9,814(704) 9,473(462) <0.001
Season best, n (%) 12 (7.6%) 48 (66%) <0.001
Did not finish, n (%) 23 (15%) 15 (17%) 0.6
Age, mean year(SD) 31.5(4.9) 31.0(4.4) 0.5
Continent,n (%) 0.2
Africa 22 (14%) 17 (19%)
Asia 40 (26%) 16 (18%)
Europe 62 (40%) 36 (41%)
North America 9 (5.8%) 9 (10%)
Oceania 3 (1.9%) 4 (4.5%)
South America 20 (13%) 6 (6.8%)
Number of athletes from locked down countries 149 (95%) 81 (92%) 0.4

1 Mean(SD); n (%)

2 Welch Two Sample t-test; Pearson's Chi-squared test

#Table3 summarizes demographic characteristics of female athletes according to Rio Olympic 2016 and Tokyo Olympic 2020. The mean marathon record is 9814 seconds (2H43M34sec) in Rio Olympic and 9473 seconds (2H37M53sec)in Tokyo Olympic. The t test showed that a p-value was <0.001. We conclude that there is a statistically significant difference between marathon records and Olympics among female athletes. Similar to the male athletes, the percentage of athletes who achieved a season's best time was higher in the Tokyo Olympics than that in the Rio Olympics, while there was no statistically significant difference in the percentage of athletes who drop out of the race between the two Olympics. The percentages of athletes from Asia and South America was lower in the Tokyo Olympics than those in the Rio Olympics similarly to overall and male athletes.


#Multivariable-adjusted linear regression

#male
dat1=dat %>% 
     filter(sex=="Men")
dat$continent<-as.factor(dat$continent)

model3<-lm(time_sec ~  olympic+sb+age+continent,data=dat1)
summary(model3)
## 
## Call:
## lm(formula = time_sec ~ olympic + sb + age + continent, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -740.68 -303.64  -59.32  262.07 1447.96 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8301.079    197.945  41.936  < 2e-16 ***
## olympicTokyo2020       -124.424     77.626  -1.603 0.110513    
## sb                     -139.439     82.570  -1.689 0.092797 .  
## age                       2.218      6.576   0.337 0.736302    
## continentAsia           308.430     89.030   3.464 0.000648 ***
## continentEurope         153.620     83.159   1.847 0.066151 .  
## continentNorth America  308.771    113.263   2.726 0.006965 ** 
## continentOceania        149.724    173.963   0.861 0.390433    
## continentSouth America  262.540    101.671   2.582 0.010516 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 422.5 on 204 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.1329, Adjusted R-squared:  0.0989 
## F-statistic: 3.909 on 8 and 204 DF,  p-value: 0.0002576
model3 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 4. Multivariable-adjusted linear regression",
                 subtitle = "Men")
Table 4. Multivariable-adjusted linear regression
Men
Characteristic Beta 95% CI1 p-value
(Intercept) 8,301 7,911, 8,691 <0.001
olympic
Rio2016
Tokyo2020 -124 -277, 29 0.11
sb -139 -302, 23 0.093
age 2.2 -11, 15 0.7
continent
Africa
Asia 308 133, 484 <0.001
Europe 154 -10, 318 0.066
North America 309 85, 532 0.007
Oceania 150 -193, 493 0.4
South America 263 62, 463 0.011

1 CI = Confidence Interval

#Female
dat2=dat %>% 
     filter(sex=="Women")
model4<-lm(time_sec ~  olympic+sb+age+continent,data=dat2)
summary(model4)
## 
## Call:
## lm(formula = time_sec ~ olympic + sb + age + continent, data = dat2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1358.0  -375.2   -92.9   253.0  1953.8 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            9546.732    326.980  29.197   <2e-16 ***
## olympicTokyo2020       -180.190    111.715  -1.613   0.1084    
## sb                     -193.446    119.443  -1.620   0.1069    
## age                       4.597      9.750   0.471   0.6378    
## continentAsia           317.187    147.236   2.154   0.0324 *  
## continentEurope          25.848    134.975   0.192   0.8483    
## continentNorth America  -96.733    193.853  -0.499   0.6183    
## continentOceania        -71.718    261.220  -0.275   0.7840    
## continentSouth America  395.574    174.985   2.261   0.0249 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 612.3 on 196 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.1482, Adjusted R-squared:  0.1134 
## F-statistic: 4.261 on 8 and 196 DF,  p-value: 9.703e-05
model4 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 5. Multivariable-adjusted linear regression",
                 subtitle = "Women")
Table 5. Multivariable-adjusted linear regression
Women
Characteristic Beta 95% CI1 p-value
(Intercept) 9,547 8,902, 10,192 <0.001
olympic
Rio2016
Tokyo2020 -180 -401, 40 0.11
sb -193 -429, 42 0.11
age 4.6 -15, 24 0.6
continent
Africa
Asia 317 27, 608 0.032
Europe 26 -240, 292 0.8
North America -97 -479, 286 0.6
Oceania -72 -587, 443 0.8
South America 396 50, 741 0.025

1 CI = Confidence Interval

#Making figure1
#Marathon Record in Male athlete from Asian countries

p3<-dat%>%filter(sex=="Men"|continent=="Asia") %>%
  ggplot(aes(X=factor(olympic), y=time)) +
  geom_boxplot(aes(olympic,time)) +
  scale_y_time() +
  xlab("Olympic") +
  ylab("Records" ) +
  ggtitle("Figure 1. Marathon Record in Male athlete from Asian countries")
p3

#Making figure2
#Marathon Record in Female athlete from Asian countries

p4<-dat%>%filter(sex=="Female"|continent=="Asia") %>%
  ggplot(aes(X=factor(olympic), y=time)) +
  geom_boxplot(aes(olympic,time)) +
  scale_y_time() +
  xlab("Olympic") +
  ylab("Records" ) +
  ggtitle("Figure 2. Marathon Record in Female athlete from Asian countries")
p4

#After adjusting for season-best times, age, and continent in the multivariable logistic regression model, there is no statistically significant difference in marathon times between the Rio and Tokyo Olympics for both men and women (Table 4,5). There were fewer Asian athletes participating in the Tokyo Olympics compared to the Rio Olympics. In addition, the records of Asian athletes were faster in the Tokyo Olympics than in the Rio Olympics (Figure 1,2). For example, North Korea did not participate as a country in order to protect athletes from COVID19 infection. The three women runner from North Korean who did not participate the race had marathon world rankings of 80th, 123rd, and 158th respectively. The male runner from North Korean who did not participate the race was ranked 168th. It is possible that the world marathon rankings of North Korean athletes in the Tokyo Olympics were relatively lower than the athletes who participated in the Tokyo Olympics. Therefore, it is possible that there is a selection bias for the athletes who participated in the Tokyo Olympics in this study.

#Extend the model to evaluate whether there is evidence that the association between olympic and Maratho record is different for those with and without a lockdown policy

#Male

model5<-lm(time_sec ~  olympic+sb+age+continent+olympic*lockdown,data=dat1)
summary(model5)
## 
## Call:
## lm(formula = time_sec ~ olympic + sb + age + continent + olympic * 
##     lockdown, data = dat1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -738.47 -300.67  -57.85  249.11 1443.14 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8384.002    223.080  37.583  < 2e-16 ***
## olympicTokyo2020          -210.566    220.907  -0.953  0.34163    
## sb                        -141.330     83.587  -1.691  0.09242 .  
## age                          2.533      6.613   0.383  0.70208    
## continentAsia              303.069     90.271   3.357  0.00094 ***
## continentEurope            162.227     84.391   1.922  0.05597 .  
## continentNorth America     316.314    114.555   2.761  0.00629 ** 
## continentOceania           156.677    175.165   0.894  0.37214    
## continentSouth America     261.926    102.168   2.564  0.01108 *  
## lockdown                  -104.033    127.067  -0.819  0.41391    
## olympicTokyo2020:lockdown   94.746    220.999   0.429  0.66858    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 423.9 on 202 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.1358, Adjusted R-squared:  0.09299 
## F-statistic: 3.174 on 10 and 202 DF,  p-value: 0.0008407
model5 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 5. Multivariable-adjusted linear regression (evaluate the interaction by lockdown policy)",
                 subtitle = "Men")
Table 5. Multivariable-adjusted linear regression (evaluate the interaction by lockdown policy)
Men
Characteristic Beta 95% CI1 p-value
(Intercept) 8,384 7,944, 8,824 <0.001
olympic
Rio2016
Tokyo2020 -211 -646, 225 0.3
sb -141 -306, 23 0.092
age 2.5 -11, 16 0.7
continent
Africa
Asia 303 125, 481 <0.001
Europe 162 -4.2, 329 0.056
North America 316 90, 542 0.006
Oceania 157 -189, 502 0.4
South America 262 60, 463 0.011
lockdown -104 -355, 147 0.4
olympic * lockdown
Tokyo2020 * lockdown 95 -341, 531 0.7

1 CI = Confidence Interval

#Female

model6<-lm(time_sec ~  olympic+sb+age+continent+olympic*lockdown,data=dat2)
summary(model6)
## 
## Call:
## lm(formula = time_sec ~ olympic + sb + age + continent + olympic * 
##     lockdown, data = dat2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1389.52  -405.53   -99.26   254.52  1936.37 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               9514.198    392.095  24.265   <2e-16 ***
## olympicTokyo2020          -449.356    328.825  -1.367   0.1733    
## sb                        -180.890    119.751  -1.511   0.1325    
## age                          3.426      9.781   0.350   0.7265    
## continentAsia              350.767    150.165   2.336   0.0205 *  
## continentEurope             18.300    135.293   0.135   0.8925    
## continentNorth America    -111.527    194.178  -0.574   0.5664    
## continentOceania           -90.294    261.501  -0.345   0.7302    
## continentSouth America     389.659    175.253   2.223   0.0273 *  
## lockdown                    67.913    234.012   0.290   0.7720    
## olympicTokyo2020:lockdown  296.560    331.623   0.894   0.3723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 612.1 on 194 degrees of freedom
##   (40 observations deleted due to missingness)
## Multiple R-squared:  0.1573, Adjusted R-squared:  0.1139 
## F-statistic: 3.621 on 10 and 194 DF,  p-value: 0.0001939
model6 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 6. Multivariable-adjusted linear regression (evaluate the interaction by lockdown policy)",
                 subtitle = "Women")
Table 6. Multivariable-adjusted linear regression (evaluate the interaction by lockdown policy)
Women
Characteristic Beta 95% CI1 p-value
(Intercept) 9,514 8,741, 10,288 <0.001
olympic
Rio2016
Tokyo2020 -449 -1,098, 199 0.2
sb -181 -417, 55 0.13
age 3.4 -16, 23 0.7
continent
Africa
Asia 351 55, 647 0.021
Europe 18 -249, 285 0.9
North America -112 -494, 271 0.6
Oceania -90 -606, 425 0.7
South America 390 44, 735 0.027
lockdown 68 -394, 529 0.8
olympic * lockdown
Tokyo2020 * lockdown 297 -357, 951 0.4

1 CI = Confidence Interval

#We evaluate whether there is evidence that the association between Olympics and marathon record is different between countries with and without a lockdown policy. The multivariable linear regression showed that a p-value was 0.7 for male and 0.4 for female. Therefore, we fail to reject the null hypothesis at a 0.05 level of significance and conclude that the association between marathon records and pre-/post-COVID19 Olympics does not significantly differ by the lockdown policy(Table 5,6). However, the sample size of countries without the lockdown policy is small. Therefore, it is possible that we didn't have enough power to detect a statistical interaction.

#Extend the model to evaluate whether total COVID-19 cases per population of each country is a significant predictor for Marathon records.

#Men
dat3<-dat %>% 
  filter(sex %in% "Men"|olympic%in%"Tokyo2020")
model7<-lm(time_sec ~  olympic+case_pp+sb+age+continent+olympic*lockdown,data=dat3)
summary(model7)
## 
## Call:
## lm(formula = time_sec ~ olympic + case_pp + sb + age + continent + 
##     olympic * lockdown, data = dat3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1084.94  -456.92   -64.25   396.86  2040.20 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                8457.828    291.377  29.027  < 2e-16 ***
## olympicTokyo2020            207.014    247.508   0.836 0.403664    
## case_pp                   -1361.946   1217.529  -1.119 0.264288    
## sb                          -80.738     94.464  -0.855 0.393470    
## age                          -1.690      8.298  -0.204 0.838769    
## continentAsia               400.089    118.541   3.375 0.000845 ***
## continentEurope             362.228    123.162   2.941 0.003551 ** 
## continentNorth America      349.691    146.272   2.391 0.017495 *  
## continentOceania            270.789    204.337   1.325 0.186210    
## continentSouth America      405.000    148.491   2.727 0.006797 ** 
## lockdown                    -94.404    179.015  -0.527 0.598380    
## olympicTokyo2020:lockdown   218.884    249.973   0.876 0.382000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 598.3 on 273 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.1282, Adjusted R-squared:  0.0931 
## F-statistic: 3.651 on 11 and 273 DF,  p-value: 7.609e-05
model7 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 7. Multivariable-adjusted linear regression",
                 subtitle = "Men in Tokyo2020")
Table 7. Multivariable-adjusted linear regression
Men in Tokyo2020
Characteristic Beta 95% CI1 p-value
(Intercept) 8,458 7,884, 9,031 <0.001
olympic
Rio2016
Tokyo2020 207 -280, 694 0.4
case_pp -1,362 -3,759, 1,035 0.3
sb -81 -267, 105 0.4
age -1.7 -18, 15 0.8
continent
Africa
Asia 400 167, 633 <0.001
Europe 362 120, 605 0.004
North America 350 62, 638 0.017
Oceania 271 -131, 673 0.2
South America 405 113, 697 0.007
lockdown -94 -447, 258 0.6
olympic * lockdown
Tokyo2020 * lockdown 219 -273, 711 0.4

1 CI = Confidence Interval

#Women
dat4<-dat %>% 
  filter(sex %in% "Women"|olympic%in%"Tokyo2020")
model8<-lm(time_sec ~  olympic+case_pp+sb+age+continent+olympic*lockdown,data=dat4)
summary(model8)
## 
## Call:
## lm(formula = time_sec ~ olympic + case_pp + sb + age + continent + 
##     olympic * lockdown, data = dat4)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1279.58  -541.51   -73.87   451.76  2122.91 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                9410.103    402.519  23.378  < 2e-16 ***
## olympicTokyo2020           -932.300    322.767  -2.888 0.004194 ** 
## case_pp                   -3766.681   1357.706  -2.774 0.005929 ** 
## sb                          -50.271    112.378  -0.447 0.655000    
## age                           1.229      9.792   0.125 0.900235    
## continentAsia               557.489    145.716   3.826 0.000163 ***
## continentEurope             393.443    145.295   2.708 0.007215 ** 
## continentNorth America      256.979    179.797   1.429 0.154112    
## continentOceania             39.601    238.624   0.166 0.868319    
## continentSouth America      651.046    180.115   3.615 0.000360 ***
## lockdown                    194.762    267.853   0.727 0.467797    
## olympicTokyo2020:lockdown   112.134    324.107   0.346 0.729635    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 691.6 on 263 degrees of freedom
##   (76 observations deleted due to missingness)
## Multiple R-squared:  0.3646, Adjusted R-squared:  0.338 
## F-statistic: 13.72 on 11 and 263 DF,  p-value: < 2.2e-16
model8 %>% 
  tbl_regression(intercept = TRUE)%>%
as_gt() %>%
  gt::tab_header(title = "Table 8. Multivariable-adjusted linear regression",
                 subtitle = "Women in Tokyo2020")
Table 8. Multivariable-adjusted linear regression
Women in Tokyo2020
Characteristic Beta 95% CI1 p-value
(Intercept) 9,410 8,618, 10,203 <0.001
olympic
Rio2016
Tokyo2020 -932 -1,568, -297 0.004
case_pp -3,767 -6,440, -1,093 0.006
sb -50 -272, 171 0.7
age 1.2 -18, 21 >0.9
continent
Africa
Asia 557 271, 844 <0.001
Europe 393 107, 680 0.007
North America 257 -97, 611 0.2
Oceania 40 -430, 509 0.9
South America 651 296, 1,006 <0.001
lockdown 195 -333, 722 0.5
olympic * lockdown
Tokyo2020 * lockdown 112 -526, 750 0.7

1 CI = Confidence Interval

#We evaluate whether total COVID-19 cases per population of each country is a significant predictor for Marathon records.The p-values were 0.3 in male athletes and 0.006 in female athletes(Table 7,8).Therefore, we can say that the total COVID-19 cases per each country is a significant predictor for Marathon records among female athletes.It is possible that the severer the infection was, the better the marathon records were.Therefore, female athletes who participated in the Tokyo Olympics might be highly physically capable of getting better records even though they were affected by the infection.

5. Machine Learning (Yi-Ting Tsai and Mariko Ando)

Classification

Aside from looking at the difference between Rio 2016 and Tokyo 2020 Olympics, we are also interested in seeing how we can use the information of COVID cases per population and other covariates to help us “predict” marathon records. Since COVID data is only available for the Tokyo 2020 Olympics, this part of the analysis will focus only on Tokyo 2020 Olympics data. Instead of doing prediction to get the exact performance, we can simplify the task to doing classification. The first thing we have to do is to convert the continuous marathon record data into a binary variable. We can first decide a cutpoint, and then classify the marathon records that are bigger than this point into a “slower” group, and the records smaller than this point into a “faster” group.

Cutpoint Analysis for Classification (Yi-Ting Tsai)

Since we are splitting the records into faster and slower group, and men are usually faster, there will be some problem if we combine men’s and women’s data together in this analysis. For example, the faster group are all men, and the slower group are all women. To avoid this potential issue, we will conduct our classification analysis separately on men and women.

We utilize the training data to determine the optimal cutpoint by the R package cutpointr. We first create a label that assigns the top half ranking athletes as 1, and the second half ranking athletes as 0, and use this label to calculate the sensitivity and specificity for different cutpoints. We then choose the cutpoint that gives us the highest sum of sensitivity and specificity.

Optimal Cutpoint for Men

datsecond <- dat %>% 
  filter(dnf==0 & olympic=="Tokyo2020" & sex=="Men") %>%
  dplyr::select(rank, time_sec, case_pp, continent, age, gdp2020, prior_attend) 

datsecond <- datsecond %>% 
  mutate(rank_b = ifelse(as.numeric(rank)<=38, 1, 0))

set.seed(1)
train.rows <- sample(rownames(datsecond), dim(datsecond)[1]*0.7)
test.rows <- setdiff(rownames(datsecond), train.rows)
train_set <- datsecond[train.rows, ]
test_set <- datsecond[test.rows, ]

opt_cut <- cutpointr(train_set, time_sec, rank_b, pos_class = 1, direction = "<=")  #8239
plot_metric(opt_cut)

Optimal Cutpoint for Women

datsecond <- dat %>% 
  filter(dnf==0 & olympic=="Tokyo2020" & sex=="Women") %>%
  dplyr::select(rank, time_sec, case_pp, continent, age, gdp2020, prior_attend) 

datsecond <- datsecond %>% 
  mutate(rank_b = ifelse(as.numeric(rank)<=37, 1, 0))

set.seed(1)
train.rows <- sample(rownames(datsecond), dim(datsecond)[1]*0.7)
test.rows <- setdiff(rownames(datsecond), train.rows)
train_set <- datsecond[train.rows, ]
test_set <- datsecond[test.rows, ]

opt_cut <- cutpointr(train_set, time_sec, rank_b, pos_class = 1, direction = "<=")  # 9339
plot_metric(opt_cut)

The above two graphs are plotting the sum of sensitivity and specificity (using the function sum_sens_spec) against different cutpoints, and we are choosing the cutpoint that gives us the highest value on the y-axis. From this analysis, the optimal cutpoint for men is 8239, and the optimal cutpoint for women is 9339. We will use these cutpoints throughout our following machine learning and bootstrapping models.

Machine learning: model building & comparison (Mariko Ando)

After we investigated the athletes’ performance by comparing 2016 vs 2020, we also got interested in building models that could classify the athletes’ performance during the COVID-19 pandemic by using 2020 Olympic data. By using the best model, we can classify the athletes’ performance in the future Olympic Games during the COVID-19 pandemic, which is interesting to athletes and some others. We included the COVID-19 severity, continent where the athletes came from, age, gdp in 2020 of the country where the athletes came from, and prior attendance at Rio 2016 in the model for classifying the performance (1: worse record, 0: better record). As male athletes ran much faster than female athletes, we decided to separately build models for males and females.

We compared six models including logistic regression, Naive Bayes, knn, QDA, LDA, and Trees in terms of model accuracy and discrimination (by AUC). Table 1 shows the summary of model comparison for males, and table 2 shows the summary of model comparison for females. .

Analysis for Men’s records

dat=as.data.frame(dat)
# library
library(tidyverse)
library(rvest)
library(lubridate)
library(broom)
library(caret)
library(ggplot2)
library(e1071)
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:gtsummary':
## 
##     select
## The following object is masked from 'package:dplyr':
## 
##     select
library(rpart)
library(randomForest)
library(pROC)
library(adabag)
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## Loading required package: doParallel
## Loading required package: iterators
## Loading required package: parallel
library(splitstackshape)
library(knitr)
# filter: 2020, Men, finish race
# I created a dataset, datsecond, just including male athletes who finished race in the Tokyo 2020 game.
# As men ran much faster than women, we separately analyzed the male and female datasets.
datsecond <- dat %>% filter(dnf==0 & olympic=="Tokyo2020" & sex=="Men") %>% dplyr::select(time_sec, case_pp, continent, age, gdp2020, prior_attend) 
summary(datsecond)
##     time_sec       case_pp           continent              age       
##  Min.   :7718   Min.   :0.0000085   Length:76          Min.   :23.00  
##  1st Qu.:8108   1st Qu.:0.0067817   Class :character   1st Qu.:29.00  
##  Median :8252   Median :0.0465300   Mode  :character   Median :31.00  
##  Mean   :8324   Mean   :0.0515946                      Mean   :31.22  
##  3rd Qu.:8498   3rd Qu.:0.0918954                      3rd Qu.:33.25  
##  Max.   :9876   Max.   :0.1576326                      Max.   :44.00  
##                 NA's   :1                                             
##     gdp2020           prior_attend 
##  Min.   :1.845e+09   Min.   :0.00  
##  1st Qu.:2.064e+11   1st Qu.:0.00  
##  Median :5.153e+11   Median :0.00  
##  Mean   :2.483e+12   Mean   :0.25  
##  3rd Qu.:1.765e+12   3rd Qu.:0.25  
##  Max.   :2.094e+13   Max.   :1.00  
##  NA's   :5
# convert binary and categorical data as factor
datsecond <- datsecond %>% 
  mutate(continent=as.factor(continent), prior_attend=as.factor(prior_attend))

# omit rows with NA
# Here, we decided to conduct complete case analysis.
datsecond<-datsecond %>% filter(!is.na(time_sec))%>% filter(!is.na(case_pp))%>% filter(!is.na(continent))%>% filter(!is.na(age))%>% filter(!is.na(gdp2020))%>% filter(!is.na(prior_attend))
summary(datsecond)
##     time_sec       case_pp                  continent       age       
##  Min.   :7718   Min.   :8.520e-06   Africa       :15   Min.   :24.00  
##  1st Qu.:8116   1st Qu.:1.108e-02   Asia         :12   1st Qu.:29.00  
##  Median :8307   Median :5.319e-02   Europe       :23   Median :31.00  
##  Mean   :8339   Mean   :5.220e-02   North America:11   Mean   :31.42  
##  3rd Qu.:8516   3rd Qu.:9.190e-02   Oceania      : 4   3rd Qu.:33.50  
##  Max.   :9876   Max.   :1.083e-01   South America: 6   Max.   :44.00  
##     gdp2020          prior_attend
##  Min.   :1.845e+09   0:52        
##  1st Qu.:2.064e+11   1:19        
##  Median :5.153e+11               
##  Mean   :2.483e+12               
##  3rd Qu.:1.765e+12               
##  Max.   :2.094e+13
# Based on the calculation by Yi-Ting, one member of our team, we used 8239 sec as a cutpoint for male athletes. We defined time_sec<8239 as better record (outcome=0) and time_sec>=8239 as worse record (outcome=1).

cut <- 8239
datsecond<-datsecond%>%
  mutate(timebinary=ifelse(time_sec<cut,0,1)) %>%
  mutate(timebinary=as.factor(timebinary))
summary(datsecond)
##     time_sec       case_pp                  continent       age       
##  Min.   :7718   Min.   :8.520e-06   Africa       :15   Min.   :24.00  
##  1st Qu.:8116   1st Qu.:1.108e-02   Asia         :12   1st Qu.:29.00  
##  Median :8307   Median :5.319e-02   Europe       :23   Median :31.00  
##  Mean   :8339   Mean   :5.220e-02   North America:11   Mean   :31.42  
##  3rd Qu.:8516   3rd Qu.:9.190e-02   Oceania      : 4   3rd Qu.:33.50  
##  Max.   :9876   Max.   :1.083e-01   South America: 6   Max.   :44.00  
##     gdp2020          prior_attend timebinary
##  Min.   :1.845e+09   0:52         0:32      
##  1st Qu.:2.064e+11   1:19         1:39      
##  Median :5.153e+11                          
##  Mean   :2.483e+12                          
##  3rd Qu.:1.765e+12                          
##  Max.   :2.094e+13
# split train (70%), test (30%)
# We used stratified function to split the dataset into training and testing datasets because we wanted to get almost equal distribution of outcome in both datasets. After splitting, the training set included 49 athletes while the testing set included 22 athletes.
set.seed(1)
x <- stratified(datsecond, "timebinary", 0.70, keep.rownames = TRUE)
train_set <- x %>% dplyr::select(-rn)
train_index <- as.numeric(x$rn)
test_set <- datsecond[-train_index,]
dim(train_set)
## [1] 49  7
dim(test_set)
## [1] 22  7
# Here, we decided to build six models to predict the worse record.
# Six models include logistic regression, Naive Bayes, knn, QDA, LDA, and Trees. The model included COVID-19 severity (continuous) defined as the case number per population, continent (categorical), age (continuous), gdp2020 (continuous), and prior_attend (binary). 
#In each model, we will report accuracy, sensitivity, and specificity.

# Model1: logistic regression
# accuracy = 0.5000, sensitivity = 0.5833, specificity = 0.4000 
glm_fit <- glm(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set, family = "binomial")
p_hat_logit<-predict(glm_fit, newdata=test_set, type="response")
y_hat_logit <- factor(ifelse(p_hat_logit > 0.5, 1, 0))
confusionMatrix(as.factor(y_hat_logit), reference = test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 5
##          1 6 7
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.2822, 0.7178)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.7404          
##                                           
##                   Kappa : -0.0168         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5385          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.4917          
##                                           
##        'Positive' Class : 1               
## 
# Model2: Naive Bayes
# accuracy = 0.5909, sensitivity = 0.7500, specificity = 0.4000
nb_fit <- naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_nb<-predict(nb_fit, newdata=test_set, type="raw")[,2]
y_hat_nb<-factor(ifelse(p_hat_nb>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_nb),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 3
##          1 6 9
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.418           
##                                           
##                   Kappa : 0.1538          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.5714          
##              Prevalence : 0.5455          
##          Detection Rate : 0.4091          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.5750          
##                                           
##        'Positive' Class : 1               
## 
# Model3: knn
# For the value of k, we chose the number closest to the squared root of the sample size of the training set, which was 7.
# accuracy =  0.5000, sensitivity = 0.5833, specificity = 0.4000 
knn_fit<-knn3(timebinary ~ case_pp + continent + age  + gdp2020 + prior_attend, data = train_set, k=7)
p_hat_knn<-predict(knn_fit,newdata=test_set)[,2]
y_hat_knn<-factor(ifelse(p_hat_knn>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_knn),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 5
##          1 6 7
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.2822, 0.7178)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.7404          
##                                           
##                   Kappa : -0.0168         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5385          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.4917          
##                                           
##        'Positive' Class : 1               
## 
# Model4: QDA 
# accuracy = 0.5000 , sensitivity = 0.6667, specificity = 0.3000
set.seed(1)
qda_fit <- qda(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_qda <- predict(qda_fit,newdata=test_set)$posterior[,2]
y_hat_qda <- factor(ifelse(p_hat_qda>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_qda),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 4
##          1 7 8
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.2822, 0.7178)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.7404          
##                                           
##                   Kappa : -0.0342         
##                                           
##  Mcnemar's Test P-Value : 0.5465          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.3000          
##          Pos Pred Value : 0.5333          
##          Neg Pred Value : 0.4286          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.4833          
##                                           
##        'Positive' Class : 1               
## 
# Model5: LDA
# accuracy = 0.5000, sensitivity = 0.5833, specificity = 0.4000
set.seed(1)
lda_fit <- lda(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_lda <- predict(lda_fit,newdata=test_set)$posterior[,2]
y_hat_lda <- factor(ifelse(p_hat_lda>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_lda),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 5
##          1 6 7
##                                           
##                Accuracy : 0.5             
##                  95% CI : (0.2822, 0.7178)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.7404          
##                                           
##                   Kappa : -0.0168         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5385          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.4917          
##                                           
##        'Positive' Class : 1               
## 
# Model6: Decision trees
# accuracy = 0.5455, sensitivity = 0.5833, specificity = 0.5000
set.seed(1)
tree_fit <- rpart(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_tree <- predict(tree_fit,newdata=test_set)[,2]
y_hat_tree <- factor(ifelse(p_hat_tree>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_tree),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 5 5
##          1 5 7
##                                           
##                Accuracy : 0.5455          
##                  95% CI : (0.3221, 0.7561)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.5869          
##                                           
##                   Kappa : 0.0833          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.5833          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.5417          
##                                           
##        'Positive' Class : 1               
## 
# plot ROC curve
# Here, we compare ROC curves and AUC to see which model has a better discrimination.

roc_logit<-roc(test_set$timebinary,p_hat_logit)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_nb<-roc(test_set$timebinary,p_hat_nb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_knn<-roc(test_set$timebinary,p_hat_knn)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_qda<-roc(test_set$timebinary,p_hat_qda)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_lda<-roc(test_set$timebinary,p_hat_lda)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_tree<-roc(test_set$timebinary,p_hat_tree)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("Logistic regression"=roc_logit,"Naive Bayes"=roc_nb,"kNN, k=7"=roc_knn, "QDA model"=roc_qda,"LDA model"=roc_lda, "Decision Tree"=roc_tree))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves") +
  xlab("Specificity")+
  ylab("Sensitivity")+ 
  theme(plot.title = element_text(hjust = 0.5))

#Naive Bayes model has the highest AUC.
auc(roc_logit)
## Area under the curve: 0.4917
auc(roc_nb)
## Area under the curve: 0.65
auc(roc_knn)
## Area under the curve: 0.4833
auc(roc_qda)
## Area under the curve: 0.5375
auc(roc_lda)
## Area under the curve: 0.5333
auc(roc_tree)
## Area under the curve: 0.5625
#Table for men's models
Men_models <- c("Logistic", "Naive Bayes", "kNN, k=7", "QDA", "LDA", "Trees")
Accuracy <- c("0.5000", "0.5909", "0.5000", "0.5000", "0.5000", "0.5455")
Sensitivity <- c("0.5833", "0.7500", "0.5833", "0.6667", "0.5833", "0.5833")
Specificity <- c("0.4000", "0.4000", "0.4000", "0.3000", "0.4000", "0.5000")
PPV <- c("0.5385", "0.6000", "0.5385", "0.5333", "0.5385", "0.5833")
NPV <- c("0.4444", "0.5714", "0.4444", "0.4286", "0.4444", "0.5000")
AUC <- c("0.4917", "0.6500", "0.4833", "0.5375", "0.5333", "0.5625")
male_df <- data.frame(Men_models, Accuracy, Sensitivity, Specificity, PPV, NPV, AUC)

male_df2 <- head(male_df)
knitr::kable(male_df2, col.names = gsub("[.]", " ", names(male_df)), caption = "Table 1: Model comparison for male athletes")
Table 1: Model comparison for male athletes
Men_models Accuracy Sensitivity Specificity PPV NPV AUC
Logistic 0.5000 0.5833 0.4000 0.5385 0.4444 0.4917
Naive Bayes 0.5909 0.7500 0.4000 0.6000 0.5714 0.6500
kNN, k=7 0.5000 0.5833 0.4000 0.5385 0.4444 0.4833
QDA 0.5000 0.6667 0.3000 0.5333 0.4286 0.5375
LDA 0.5000 0.5833 0.4000 0.5385 0.4444 0.5333
Trees 0.5455 0.5833 0.5000 0.5833 0.5000 0.5625
(Summary) According to the outputs, Naive Bayes model showed the highest accuracy and the highest AUC (i.e., best discrimination). Then, Naive Bayes model could be the best model to classify male athletes’ performance in the future Olympic Games during a similar pandemic. However, as the current dataset included very small number of samples, we should ideally repeat similar analyses by using bootstrapping methods or other datasets with larger sample size of athletes and compare the models.

Analysis for women’s records

# filter: 2020, Women, finish race
# I created a dataset, datsecondfemale, just including female athletes who finished race in the Tokyo 2020 game.

datsecondfemale <- dat %>% filter(dnf==0 & olympic=="Tokyo2020" & sex=="Women") %>% dplyr::select(time_sec, case_pp, continent, age, gdp2020, prior_attend) 
summary(datsecondfemale)
##     time_sec        case_pp           continent              age       
##  Min.   : 8840   Min.   :8.520e-06   Length:73          Min.   :24.00  
##  1st Qu.: 9194   1st Qu.:6.782e-03   Class :character   1st Qu.:28.00  
##  Median : 9339   Median :4.632e-02   Mode  :character   Median :31.00  
##  Mean   : 9473   Mean   :5.392e-02                      Mean   :31.18  
##  3rd Qu.: 9604   3rd Qu.:9.088e-02                      3rd Qu.:34.00  
##  Max.   :10990   Max.   :1.576e-01                      Max.   :44.00  
##                                                                        
##     gdp2020           prior_attend   
##  Min.   :1.551e+09   Min.   :0.0000  
##  1st Qu.:1.103e+11   1st Qu.:0.0000  
##  Median :5.411e+11   Median :0.0000  
##  Mean   :2.164e+12   Mean   :0.1507  
##  3rd Qu.:1.644e+12   3rd Qu.:0.0000  
##  Max.   :2.094e+13   Max.   :1.0000  
##  NA's   :2
# convert binary and categorical data as factor
datsecondfemale <- datsecondfemale %>% 
  mutate(continent=as.factor(continent), prior_attend=as.factor(prior_attend))

# omit rows with NA
#Here, we decided to conduct complete case analysis again.
datsecondfemale<-datsecondfemale %>% filter(!is.na(time_sec))%>% filter(!is.na(case_pp))%>% filter(!is.na(continent))%>% filter(!is.na(age))%>% filter(!is.na(gdp2020))%>% filter(!is.na(prior_attend))
summary(datsecondfemale)
##     time_sec        case_pp                  continent       age       
##  Min.   : 8840   Min.   :8.520e-06   Africa       :11   Min.   :24.00  
##  1st Qu.: 9194   1st Qu.:6.782e-03   Asia         :13   1st Qu.:27.50  
##  Median : 9339   Median :4.632e-02   Europe       :30   Median :31.00  
##  Mean   : 9481   Mean   :5.319e-02   North America: 7   Mean   :31.23  
##  3rd Qu.: 9607   3rd Qu.:8.955e-02   Oceania      : 4   3rd Qu.:34.00  
##  Max.   :10990   Max.   :1.561e-01   South America: 6   Max.   :44.00  
##     gdp2020          prior_attend
##  Min.   :1.551e+09   0:60        
##  1st Qu.:1.103e+11   1:11        
##  Median :5.411e+11               
##  Mean   :2.164e+12               
##  3rd Qu.:1.644e+12               
##  Max.   :2.094e+13
# Based on the calculation by Yi-Ting, one member of our team, we used 9339 sec as a cutpoint for female athletes. We defined time_sec<9339 as better record (outcome=0) and time_sec>=9339 as worse record (outcome=1).

cut_female <- 9339
datsecondfemale<-datsecondfemale%>%
   mutate(timebinary=ifelse(time_sec<cut_female,0,1)) %>%
   mutate(timebinary=as.factor(timebinary))
summary(datsecondfemale)
##     time_sec        case_pp                  continent       age       
##  Min.   : 8840   Min.   :8.520e-06   Africa       :11   Min.   :24.00  
##  1st Qu.: 9194   1st Qu.:6.782e-03   Asia         :13   1st Qu.:27.50  
##  Median : 9339   Median :4.632e-02   Europe       :30   Median :31.00  
##  Mean   : 9481   Mean   :5.319e-02   North America: 7   Mean   :31.23  
##  3rd Qu.: 9607   3rd Qu.:8.955e-02   Oceania      : 4   3rd Qu.:34.00  
##  Max.   :10990   Max.   :1.561e-01   South America: 6   Max.   :44.00  
##     gdp2020          prior_attend timebinary
##  Min.   :1.551e+09   0:60         0:35      
##  1st Qu.:1.103e+11   1:11         1:36      
##  Median :5.411e+11                          
##  Mean   :2.164e+12                          
##  3rd Qu.:1.644e+12                          
##  Max.   :2.094e+13
# split train (70%), test (30%)
# We used stratified function to split the dataset into training and testing datasets because we wanted to get almost equal distribution of outcome in both datasets. After splitting, the training set included 49 athletes while the testing set included 22 athletes.
set.seed(1)
x_female <- stratified(datsecondfemale, "timebinary", 0.70, keep.rownames = TRUE)
train_set_female <- x_female %>% dplyr::select(-rn)
train_index_female <- as.numeric(x_female$rn)
test_set_female <- datsecondfemale[-train_index_female,]
dim(train_set_female)
## [1] 49  7
dim(test_set_female)
## [1] 22  7
# Here, we decided to build six models to predict the worse record.
# Six models include logistic regression, Naive Bayes, knn, QDA, LDA, and Trees. The model included COVID-19 severity (continuous) defined as the case number per population, continent (categorical), age (continuous), gdp2020 (continuous), and prior_attend (binary).  
#In each model, we will report accuracy, sensitivity, and specificity.

# Model1: logistic regression
# accuracy = 0.5909, sensitivity = 0.7273 , specificity = 0.4545 
glm_fit_female <- glm(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set_female, family = "binomial")
p_hat_logit_female<-predict(glm_fit_female, newdata=test_set_female, type="response")
y_hat_logit_female <- factor(ifelse(p_hat_logit_female > 0.5, 1, 0))
confusionMatrix(as.factor(y_hat_logit_female), reference = test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 5 3
##          1 6 8
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2617          
##                                           
##                   Kappa : 0.1818          
##                                           
##  Mcnemar's Test P-Value : 0.5050          
##                                           
##             Sensitivity : 0.7273          
##             Specificity : 0.4545          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.6250          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.6364          
##       Balanced Accuracy : 0.5909          
##                                           
##        'Positive' Class : 1               
## 
# Model2: Naive Bayes
# accuracy = 0.3636, sensitivity = 0.4545, specificity = 0.2727  
nb_fit_female <- naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set_female)
p_hat_nb_female<-predict(nb_fit, newdata=test_set_female, type="raw")[,2]
y_hat_nb_female<-factor(ifelse(p_hat_nb_female>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_nb_female),reference=test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 6
##          1 8 5
##                                          
##                Accuracy : 0.3636         
##                  95% CI : (0.172, 0.5934)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 0.9331         
##                                          
##                   Kappa : -0.2727        
##                                          
##  Mcnemar's Test P-Value : 0.7893         
##                                          
##             Sensitivity : 0.4545         
##             Specificity : 0.2727         
##          Pos Pred Value : 0.3846         
##          Neg Pred Value : 0.3333         
##              Prevalence : 0.5000         
##          Detection Rate : 0.2273         
##    Detection Prevalence : 0.5909         
##       Balanced Accuracy : 0.3636         
##                                          
##        'Positive' Class : 1              
## 
# Model3: knn
# For the value of k, I chose the number closest to the squared root of the sample size of the training set, which was 7.
# accuracy = 0.7273, sensitivity = 0.6364, specificity = 0.8182
knn_fit_female<-knn3(timebinary ~ case_pp + continent + age  + gdp2020 + prior_attend, data = train_set_female, k=7)
p_hat_knn_female<-predict(knn_fit_female,newdata=test_set_female)[,2]
y_hat_knn_female<-factor(ifelse(p_hat_knn_female>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_knn_female),reference=test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 9 4
##          1 2 7
##                                           
##                Accuracy : 0.7273          
##                  95% CI : (0.4978, 0.8927)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.02624         
##                                           
##                   Kappa : 0.4545          
##                                           
##  Mcnemar's Test P-Value : 0.68309         
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.7778          
##          Neg Pred Value : 0.6923          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.4091          
##       Balanced Accuracy : 0.7273          
##                                           
##        'Positive' Class : 1               
## 
# Model4: QDA  
# accuracy = 0.4091 , sensitivity = 0.4545, specificity = 0.3636
set.seed(1)
qda_fit_female <- qda(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set_female)
p_hat_qda_female <- predict(qda_fit_female,newdata=test_set_female)$posterior[,2]
y_hat_qda_female <- factor(ifelse(p_hat_qda_female>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_qda_female),reference=test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 6
##          1 7 5
##                                           
##                Accuracy : 0.4091          
##                  95% CI : (0.2071, 0.6365)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.8569          
##                                           
##                   Kappa : -0.1818         
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.4545          
##             Specificity : 0.3636          
##          Pos Pred Value : 0.4167          
##          Neg Pred Value : 0.4000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2273          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.4091          
##                                           
##        'Positive' Class : 1               
## 
# Model5: LDA
# accuracy = 0.5455 , sensitivity = 0.7273, specificity = 0.3636
set.seed(1)
lda_fit_female <- lda(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set_female)
p_hat_lda_female <- predict(lda_fit_female,newdata=test_set_female)$posterior[,2]
y_hat_lda_female <- factor(ifelse(p_hat_lda_female>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_lda_female),reference=test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 3
##          1 7 8
##                                           
##                Accuracy : 0.5455          
##                  95% CI : (0.3221, 0.7561)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.4159          
##                                           
##                   Kappa : 0.0909          
##                                           
##  Mcnemar's Test P-Value : 0.3428          
##                                           
##             Sensitivity : 0.7273          
##             Specificity : 0.3636          
##          Pos Pred Value : 0.5333          
##          Neg Pred Value : 0.5714          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.5455          
##                                           
##        'Positive' Class : 1               
## 
# Model6: Decision trees
# accuracy = 0.5909, sensitivity =  0.6364, specificity = 0.5455  
set.seed(1)
tree_fit_female <- rpart(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set_female)
p_hat_tree_female <- predict(tree_fit_female,newdata=test_set_female)[,2]
y_hat_tree_female <- factor(ifelse(p_hat_tree_female>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_tree_female),reference=test_set_female$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 6 4
##          1 5 7
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2617          
##                                           
##                   Kappa : 0.1818          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.5833          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.5909          
##                                           
##        'Positive' Class : 1               
## 
# plot ROC curve
# Here, we compare ROC curves and AUC to see which model has a better discrimination.
roc_logit_female<-roc(test_set_female$timebinary,p_hat_logit_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_nb_female<-roc(test_set_female$timebinary,p_hat_nb_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_knn_female<-roc(test_set_female$timebinary,p_hat_knn_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_qda_female<-roc(test_set_female$timebinary,p_hat_qda_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_lda_female<-roc(test_set_female$timebinary,p_hat_lda_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_tree_female<-roc(test_set_female$timebinary,p_hat_tree_female)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
ggroc(list("Logistic regression"=roc_logit_female,"Naive Bayes"=roc_nb_female,"kNN, k=7"=roc_knn_female, "QDA model"=roc_qda_female, "LDA model"=roc_lda_female, "Decision Tree"=roc_tree_female))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves") +
  xlab("Specificity")+
  ylab("Sensitivity")+ 
  theme(plot.title = element_text(hjust = 0.5))

#Decision Trees is the best model in terms of AUC in women.
auc(roc_logit_female)
## Area under the curve: 0.5868
auc(roc_nb_female)
## Area under the curve: 0.7107
auc(roc_knn_female)
## Area under the curve: 0.8017
auc(roc_qda_female)
## Area under the curve: 0.5868
auc(roc_lda_female)
## Area under the curve: 0.5702
auc(roc_tree_female)
## Area under the curve: 0.5826
#Table for women's models
Women_models <- c("Logistic", "Naive Bayes", "kNN, k=7", "QDA", "LDA", "Trees")
Accuracy <- c("0.5909", "0.3636", "0.7273", "0.4091", "0.5455", "0.5909")
Sensitivity <- c("0.7273", "0.4545", "0.6364", "0.4545", "0.7273", "0.6364")
Specificity <- c("0.4545", "0.2727", "0.8182", "0.3636", "0.3636", "0.5455")
PPV <- c("0.5714", "0.3846", "0.7778", "0.4167", "0.5333", "0.5833")
NPV <- c("0.6250", "0.3333", "0.6923", "0.4000", "0.5714", "0.6000")
AUC <- c("0.5868", "0.7107", "0.8017", "0.5868", "0.5702", "0.5826")
female_df <- data.frame(Women_models, Accuracy, Sensitivity, Specificity, PPV, NPV, AUC)

female_df2 <- head(female_df)
knitr::kable(female_df2, col.names = gsub("[.]", " ", names(female_df)), caption = "Table 2: Model comparison for female athletes")
Table 2: Model comparison for female athletes
Women_models Accuracy Sensitivity Specificity PPV NPV AUC
Logistic 0.5909 0.7273 0.4545 0.5714 0.6250 0.5868
Naive Bayes 0.3636 0.4545 0.2727 0.3846 0.3333 0.7107
kNN, k=7 0.7273 0.6364 0.8182 0.7778 0.6923 0.8017
QDA 0.4091 0.4545 0.3636 0.4167 0.4000 0.5868
LDA 0.5455 0.7273 0.3636 0.5333 0.5714 0.5702
Trees 0.5909 0.6364 0.5455 0.5833 0.6000 0.5826
(Summary) According to the outputs, the knn model showed the highest accuracy and the highest AUC (i.e., best discrimination). Then, knn model could be the best model to classify female athletes’ performance in the future Olympic Games during a similar pandemic. However, as the current dataset included very small number of samples, we should ideally repeat similar analyses by using bootstrapping methods or other datasets with larger sample size of athletes and compare the models.

Bootstrapping (Yi-Ting Tsai)

Since the sample size of our dataset is very small, only 71 observations for men and women, respectively, this may cause a problem when training machine learning models. Therefore, we try to apply some methods that incorporate bootstrapping to see whether we can get better results.

We try to compare two sets of original vs. bootstrapped models:

  1. Decision tree and random forest

  2. Naive Bayes and the bootstrapped version of Naive Bayes

We choose to compare decision tree and random forest since random forest is a very well-known machine learning model, and we would really like to see how can bootstrapping makes a difference between these two models. We choose Naive Bayes since it performs well among all of the machine learning models we’ve tried above (the best for men and the second best for women if we are looking at AUC). We would like to know whether doing bootstrapping can make the performances become even better.

For random forest, we simply apply the random forest function in R. For Naive Bayes, we do the bootstrap manually. We first sample with replacement from the training data 100 times, each time with sample size equal to the training set. Then, for each resample, we fit a model and get the predicted class for each test data. We then use majority vote among these 100 models to determine the class of each data.

We will first fit the models and then present the ROC curves and some statistics at the end to do some comparisons.

Bootstrapping for Men

# preparing the data
# filter: 2020, Men, finish race
datsecond <- dat %>% 
  filter(dnf==0 & olympic=="Tokyo2020" & sex=="Men") %>%
  dplyr::select(time_sec, case_pp, continent, age, gdp2020, prior_attend) 

# convert categorical/binary variables into factors
datsecond <- datsecond %>% 
  mutate(continent=as.factor(continent), prior_attend=as.factor(prior_attend))

# omit the rows with NA
datsecond<-datsecond %>% filter(!is.na(time_sec))%>% filter(!is.na(case_pp))%>% filter(!is.na(continent))%>% filter(!is.na(age))%>% filter(!is.na(gdp2020))%>% filter(!is.na(prior_attend))

# add the binary outcome variable
cut <- 8239
datsecond<-datsecond%>%
  mutate(timebinary=ifelse(time_sec<cut,0,1)) %>%
  mutate(timebinary=as.factor(timebinary))

# split train (70%), test (30%)
set.seed(1)
x <- stratified(datsecond, "timebinary", 0.70, keep.rownames = TRUE)
train_set <- x %>% dplyr::select(-rn)
train_index <- as.numeric(x$rn)
test_set <- datsecond[-train_index,]

Decision Tree and Random Forest (Men)

# Decision trees
set.seed(1)
tree_fit <- rpart(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_tree <- predict(tree_fit,newdata=test_set)[,2]
y_hat_tree <- factor(ifelse(p_hat_tree>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_tree),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 5 5
##          1 5 7
##                                           
##                Accuracy : 0.5455          
##                  95% CI : (0.3221, 0.7561)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.5869          
##                                           
##                   Kappa : 0.0833          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5833          
##             Specificity : 0.5000          
##          Pos Pred Value : 0.5833          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.5417          
##                                           
##        'Positive' Class : 1               
## 
# Random Forest
set.seed(1)
rf_fit <- randomForest(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_rf <- predict(rf_fit,newdata=test_set,type="prob")[,2]
y_hat_rf <- factor(ifelse(p_hat_rf>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_rf),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 4
##          1 6 8
##                                           
##                Accuracy : 0.5455          
##                  95% CI : (0.3221, 0.7561)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.5869          
##                                           
##                   Kappa : 0.0678          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.6364          
##       Balanced Accuracy : 0.5333          
##                                           
##        'Positive' Class : 1               
## 
# ROC curves
roc_tree <- pROC::roc(test_set$timebinary,p_hat_tree)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_rf <- pROC::roc(test_set$timebinary,p_hat_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
p1 <- ggroc(list("Decision Tree"=roc_tree, "Random Forest"=roc_rf))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves - tree (men)") +
  xlab("Specificity")+ylab("Sensitivity")+
  theme(plot.title = element_text(hjust = 0.5))

## AUC
pROC::auc(roc_tree)
## Area under the curve: 0.5625
pROC::auc(roc_rf)
## Area under the curve: 0.5292

Naive Bayes and the Bootstrapped Version (Men)

# Naive Bayes
nb_fit <- naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_nb<-predict(nb_fit, newdata=test_set, type="raw")[,2]
y_hat_nb<-factor(ifelse(p_hat_nb>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_nb),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 3
##          1 6 9
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.418           
##                                           
##                   Kappa : 0.1538          
##                                           
##  Mcnemar's Test P-Value : 0.505           
##                                           
##             Sensitivity : 0.7500          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.5714          
##              Prevalence : 0.5455          
##          Detection Rate : 0.4091          
##    Detection Prevalence : 0.6818          
##       Balanced Accuracy : 0.5750          
##                                           
##        'Positive' Class : 1               
## 
# bootstrapped version of naive bayes
rows <- 1:49
pred_class_table <- data.frame(matrix(ncol = nrow(test_set), nrow = 0))

for(iter in 1:100){
  # create the resample train set
  sample_row <- sample(rows, size=85, replace = TRUE)
  resample_train_set <- data.frame()
  for(i in sample_row){
    add_this <- train_set[i,]
    resample_train_set <- rbind(resample_train_set, add_this)
  }
  nb_fit<-naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = resample_train_set)
  p_hat_nb_b<-predict(nb_fit,newdata=test_set,type="raw")[,2]
  y_hat_nb_b<-ifelse(p_hat_nb_b>0.5,1,0)
  pred_class_table <- rbind(pred_class_table, y_hat_nb_b)
}

p_hat_nb_b <- apply(pred_class_table,2,sum)/100
y_hat_nb_b <- factor(ifelse(p_hat_nb_b > 0.5, 1, 0))

confusionMatrix(as.factor(y_hat_nb_b), reference = test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 4
##          1 6 8
##                                           
##                Accuracy : 0.5455          
##                  95% CI : (0.3221, 0.7561)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : 0.5869          
##                                           
##                   Kappa : 0.0678          
##                                           
##  Mcnemar's Test P-Value : 0.7518          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 0.4000          
##          Pos Pred Value : 0.5714          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.5455          
##          Detection Rate : 0.3636          
##    Detection Prevalence : 0.6364          
##       Balanced Accuracy : 0.5333          
##                                           
##        'Positive' Class : 1               
## 
# ROC curves
roc_nb <- pROC::roc(test_set$timebinary,p_hat_nb)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_nb_b <- pROC::roc(test_set$timebinary,p_hat_nb_b)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
p2 <- ggroc(list("nb"=roc_nb, "nb (bootstrapped)"=roc_nb_b))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves - Naive Bayes (men)") +
  xlab("Specificity")+ylab("Sensitivity")+
  theme(plot.title = element_text(hjust = 0.5))

## AUC
pROC::auc(roc_nb)
## Area under the curve: 0.65
pROC::auc(roc_nb_b)
## Area under the curve: 0.625

Bootstrapping for Women

# preparing the data
# filter: 2020, Women, finish race
datsecond <- dat %>% 
  filter(dnf==0 & olympic=="Tokyo2020" & sex=="Women") %>%
  dplyr::select(time_sec, case_pp, continent, age, gdp2020, prior_attend) 

# convert categorical/binary variables into factors
datsecond <- datsecond %>% 
  mutate(continent=as.factor(continent), prior_attend=as.factor(prior_attend))

# omit the rows with NA
datsecond<-datsecond %>% filter(!is.na(time_sec))%>% filter(!is.na(case_pp))%>% filter(!is.na(continent))%>% filter(!is.na(age))%>% filter(!is.na(gdp2020))%>% filter(!is.na(prior_attend))

# add the binary outcome variable
cut <- 9339
datsecond<-datsecond%>%
  mutate(timebinary=ifelse(time_sec<cut,0,1)) %>%
  mutate(timebinary=as.factor(timebinary))

# split train (70%), test (30%)
set.seed(1)
x <- stratified(datsecond, "timebinary", 0.70, keep.rownames = TRUE)
train_set <- x %>% dplyr::select(-rn)
train_index <- as.numeric(x$rn)
test_set <- datsecond[-train_index,]

Decision Tree and Random Forest (Women)

# Decision trees
set.seed(1)
tree_fit <- rpart(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_tree <- predict(tree_fit,newdata=test_set)[,2]
y_hat_tree <- factor(ifelse(p_hat_tree>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_tree),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 6 4
##          1 5 7
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2617          
##                                           
##                   Kappa : 0.1818          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6364          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.5833          
##          Neg Pred Value : 0.6000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3182          
##    Detection Prevalence : 0.5455          
##       Balanced Accuracy : 0.5909          
##                                           
##        'Positive' Class : 1               
## 
# Random Forest
set.seed(1)
rf_fit <- randomForest(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_rf <- predict(rf_fit,newdata=test_set,type="prob")[,2]
y_hat_rf <- factor(ifelse(p_hat_rf>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_rf),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 7 5
##          1 4 6
##                                           
##                Accuracy : 0.5909          
##                  95% CI : (0.3635, 0.7929)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.2617          
##                                           
##                   Kappa : 0.1818          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.5455          
##             Specificity : 0.6364          
##          Pos Pred Value : 0.6000          
##          Neg Pred Value : 0.5833          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2727          
##    Detection Prevalence : 0.4545          
##       Balanced Accuracy : 0.5909          
##                                           
##        'Positive' Class : 1               
## 
# ROC curves
roc_tree <- pROC::roc(test_set$timebinary,p_hat_tree)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_rf <- pROC::roc(test_set$timebinary,p_hat_rf)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
p3 <- ggroc(list("Decision Tree"=roc_tree, "Random Forest"=roc_rf))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves - tree (women)") +
  xlab("Specificity")+ylab("Sensitivity")+
  theme(plot.title = element_text(hjust = 0.5))

## AUC
pROC::auc(roc_tree)
## Area under the curve: 0.5826
pROC::auc(roc_rf)
## Area under the curve: 0.6777

Naive Bayes and the Bootstrapped Version (Women)

# Naive Bayes
nb_fit <- naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = train_set)
p_hat_nb<-predict(nb_fit, newdata=test_set, type="raw")[,2]
y_hat_nb<-factor(ifelse(p_hat_nb>0.5,1,0))
confusionMatrix(data=as.factor(y_hat_nb),reference=test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 5
##          1 7 6
##                                           
##                Accuracy : 0.4545          
##                  95% CI : (0.2439, 0.6779)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : -0.0909         
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.5455          
##             Specificity : 0.3636          
##          Pos Pred Value : 0.4615          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2727          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.4545          
##                                           
##        'Positive' Class : 1               
## 
# bootstrapped version of Naive Bayes
rows <- 1:49
pred_class_table <- data.frame(matrix(ncol = nrow(test_set), nrow = 0))

for(iter in 1:100){
  # create the resample train set
  sample_row <- sample(rows, size=85, replace = TRUE)
  resample_train_set <- data.frame()
  for(i in sample_row){
    add_this <- train_set[i,]
    resample_train_set <- rbind(resample_train_set, add_this)
  }
  nb_fit<-naiveBayes(timebinary ~ case_pp + continent + age + gdp2020 + prior_attend, data = resample_train_set)
  p_hat_nb_b<-predict(nb_fit,newdata=test_set,type="raw")[,2]
  y_hat_nb_b<-ifelse(p_hat_nb_b>0.5,1,0)
  pred_class_table <- rbind(pred_class_table, y_hat_nb_b)
}

p_hat_nb_b <- apply(pred_class_table,2,sum)/100
y_hat_nb_b <- factor(ifelse(p_hat_nb_b > 0.5, 1, 0))

confusionMatrix(as.factor(y_hat_nb_b), reference = test_set$timebinary,positive="1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 4 5
##          1 7 6
##                                           
##                Accuracy : 0.4545          
##                  95% CI : (0.2439, 0.6779)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.7383          
##                                           
##                   Kappa : -0.0909         
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.5455          
##             Specificity : 0.3636          
##          Pos Pred Value : 0.4615          
##          Neg Pred Value : 0.4444          
##              Prevalence : 0.5000          
##          Detection Rate : 0.2727          
##    Detection Prevalence : 0.5909          
##       Balanced Accuracy : 0.4545          
##                                           
##        'Positive' Class : 1               
## 
# ROC curves
roc_nb <- pROC::roc(test_set$timebinary,p_hat_nb)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
roc_nb_b <- pROC::roc(test_set$timebinary,p_hat_nb_b)
## Setting levels: control = 0, case = 1
## Setting direction: controls > cases
p4 <- ggroc(list("nb"=roc_nb, "nb (bootstrapped)"=roc_nb_b))+
  theme(legend.title=element_blank())+
  geom_segment(aes(x=1, xend=0,y=0,yend=1),color="black",linetype="dotted")+
  ggtitle("ROC curves - Naive Bayes (women)") +
  xlab("Specificity")+ylab("Sensitivity")+
  theme(plot.title = element_text(hjust = 0.5))

## AUC
pROC::auc(roc_nb)
## Area under the curve: 0.5207
pROC::auc(roc_nb_b)
## Area under the curve: 0.4793

ROC Curves

The ROC curves comparing “decision tree and random forest”, “Naive Bayes and the bootstrapped version of Naive Bayes”, for men and women, are as follows.

In the case of “decision tree and random forest” for women, we can see that the ROC curve moves towards the upper left corner after bootstrapping. However, in the other three cases, the ROC curves do not seem to be very different before and after bootstrapping.

p1

p2

p3

p4

Summary Statistics

Here, we list out some summary statistics from the confusion matrix, for men and women, respectively.

# Table for men's models
Men_models <- c("Tree", "Random Forest", "NB", "NB (bootstrapped)")
Accuracy <- c("0.5455", "0.5455", "0.5909", "0.5455")
Sensitivity <- c("0.5833", "0.6667", "0.7500", "0.6667")
Specificity <- c("0.5000", "0.4000", "0.4000", "0.4000")
PPV <- c("0.5833", "0.5714", "0.6000", "0.5714")
NPV <- c("0.5000", "0.5000", "0.5714", "0.5000")
AUC <- c("0.5625", "0.5458", "0.65", "0.6167")

male_df <- data.frame(Men_models, Accuracy, Sensitivity, Specificity, PPV, NPV, AUC)

kable(male_df, format = "markdown", digits = 2)
Men_models Accuracy Sensitivity Specificity PPV NPV AUC
Tree 0.5455 0.5833 0.5000 0.5833 0.5000 0.5625
Random Forest 0.5455 0.6667 0.4000 0.5714 0.5000 0.5458
NB 0.5909 0.7500 0.4000 0.6000 0.5714 0.65
NB (bootstrapped) 0.5455 0.6667 0.4000 0.5714 0.5000 0.6167
# Table for men's models
Women_models <- c("Tree", "Random Forest", "NB", "NB (bootstrapped)")
Accuracy <- c("0.5909", "0.5909", "0.4545", "0.4545")
Sensitivity <- c("0.6364", "0.5455", "0.5455", "0.5455")
Specificity <- c("0.5455", "0.6364", "0.3636", "0.3636")
PPV <- c("0.5833", "0.6000", "0.4615", "0.4615")
NPV <- c("0.6000", "0.5833", "0.4444", "0.4444")
AUC <- c("0.5826", "0.6777", "0.5207", "0.4793")

female_df <- data.frame(Women_models, Accuracy, Sensitivity, Specificity, PPV, NPV, AUC)

kable(female_df, format = "markdown", digits = 2)
Women_models Accuracy Sensitivity Specificity PPV NPV AUC
Tree 0.5909 0.6364 0.5455 0.5833 0.6000 0.5826
Random Forest 0.5909 0.5455 0.6364 0.6000 0.5833 0.6777
NB 0.4545 0.5455 0.3636 0.4615 0.4444 0.5207
NB (bootstrapped) 0.4545 0.5455 0.3636 0.4615 0.4444 0.4793

Discussion and Comparison

We can now compare the result of decision tree to random forest, and Naive Bayes to the bootstrapped version of Naive Bayes. If we focus on AUC, we discovered that bootstrapping is only helpful in one of the four comparisons: the women case of tree and random forest. The AUC increases from 0.5826 to 0.6777. For the other 3 cases, bootstrapping either doesn’t have effect on the performance accuracy or is making the result even less accurate.

This result is quite different from our expectation, since we are hoping that bootstrapping can improve the performance of our classification models a bit. Perhaps this is still due to our small sample size problem, and it may be better if we could have a larger dataset.